I began by writing a simple rust program to give me the text contents of a PDF, to speed things up. Then, I used it to export a bunch of sets to .txts (see the scibowlsetsannotated repo).
My first thought was to have a simple pipeline where an LLM generated around 3 questions per page, which was passed into a filtering and classification algorithm/model and finally back into an LLM for option generation (in the event that it is MCQ). However, I realized that if pages were overly dense/overly sparse, this could go horribly. My current idea is something like:
- use LLM to parse textbook and chunk subjects
- store in FAISS database or something similar for nearby-vector search
- select level from vector graph to use for overarching topics + randomly select subnodes? or use LLM to generate a plan for the packet from the vector graph
- use LLM to generate targeted questions, again from vector graph sublevels
- verification by LLM? bro why do i have so many LLMs omfg
- MCQ generation by looking at nearest embeddings for distractors?