Battling Open Book Exams with Open Source LLMs
Disclaimer This is not a cheat or a hack for any examinations. This is just a tool to prepare you for your course exams better. Use it wisely.
Hi, I am Jubayer Hossain, a master's student at FAU-Erlangen. My program in ElectroMobility consists of courses in mechanical studies, AI, and programming. This semester I took 2 courses that focus on Open book examinations with all the contents of the lectures provided.
As we are open to using any resources and LLMs are a big thing now, I plan to implement an RAG-based open-source Llm with the help of Langchain to help me with content searching and better prepare me for the exam.
So without further explanation, let's get on with the project plan.
Project Plan
First, I need all the lecture slides provided in the course portal. There are about 16 pdf slides which I downloaded manually. Of course, I could write a script that will automatically download all the slides but for only 16, we can agree that it will be faster manually.
Second, we need to pick some open-source models. One should be the embedding model and another a powerful LLM that can be used as the answer generator. I chose the nomic-embed-text from Ollama as my embedding model. For the generator model, let's get our hands on the biggest free model we can use, the Llama3 70b model.
But you ask me how I should use this 70 billion parameter model on my GPU. Do I own multiple A6000s? NO! I will use Groq Cloud as my server to use the llama3–70b model.
To use Groq Cloud, open an account and get the free API key from Groq. You don't need any card or payment to use Groq cloud free version.
Third, let's talk about the retrieving process. It starts with dividing the documents into small manageable chunks. The best chunking technique for my case is just dividing the slides into pages. Each page explains a basic concept about the topic and if the concept requires several pages, the information spans across multiple pages in the slide. But we will chunk the slides using the pages because we can get multiple chunks during retrieval.
The texts are extracted from the pages and are passed through the embedding model. This will create an embedding vector for the information on the page. The embedding of the first page is matched for similarity with the next page's embedding. If their embeddings are very far apart, they are talking about different things thus they belong to different chunks.
We index the chunks into a vector database. For this project, I am using chromaDB.
Finally, when the vectorstore is ready for retrieval, we set up the LLM to receive the query with the context.
Imports and file structure
I am keeping all my pdf slides in one folder named "pdfs". As this is a research and development project, I am doing the entire coding in a Jupyter notebook. Here is the folder structure below.
.
├── RnD.ipynb
└── pdfs
├── 01_motivation_and_logistics.pdf
├── 02_computers.pdf
├── 03_data_representation (1).pdf
├── 04_floating_point_numbers.pdf
├── 05_memory_organisation.pdf
├── 06_branching_and_iterations.pdf
├── 07_abstraction_functions_tupels_lists.pdf
├── 08_recursion_and_dictionaries.pdf
├── 09_testing_debugging.pdf
├── 10_Object_Oriented_Programming.pdf
├── 11_Classes and Inheritance.pdf
├── 12_Program_efficiency_1.pdf
├── 13_Program_efficiency_2.pdf
├── 14_Searching_and_sorting.pdf
├── 15_trees.pdf
└── 16_version_management_and_git.pdf
Now let's list down the packages we would be needing for this project. Install these using the CLI or put a ! before the command if trying to run in a jupyter cell.
pip install langchain chromadb langchain_core langchain_groq tqdm pypdf langchain_community matplotlib
from tqdm import tqdm
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.prompts import ChatPromptTemplate
from langchain_groq import ChatGroq
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings
Choosing and Setting up the LLM
As mentioned above, I am using the "nomic-embed-text" model from Ollama as my embedding model for the vectorstore. For that, we would need to install Ollama on your local machine.
What is Ollama? Ollama is an open-source and free project that serves as a platform to run LLMs on your local machine. You can install Ollama from their official website. They are available for all the 3 primary OS (Mac, Linux, and Windows).
I have my Ollama installed in my WSL (windows subsystem for Linux). So when my Ubuntu terminal is running, I know that Ollama is running as well.
After downloading Ollama in your OS, you can pull the embedding model with the command,
ollama pull nomic-embed-text
This will download the 274 MB model from Ollamahub to your local machine. This model is not an LLM as you cannot chat with it per se. It is called an embedding model because it returns an embedding vector of the input string. The embedding holds semantic information about the string and is normally used to find similarities between the multiple texts.
For our generative LLM, we will be using the Llama3–70b model served from the Groq cloud. Get the API key, install langchain_groq and you can use it in your project.
Let's set up the models in our notebook.
embedding_model = OllamaEmbeddings(model="nomic-embed-text")
chat = ChatGroq(
temperature=0,
model="llama3-70b-8192",
api_key=GROQ_API_KEY
)
Prompting is important in the RAG pipeline and should be handled with care. Here using the ChatPromptTemplate from langchain, we can format the prompt below to add the retrieved context from the vectorstore and the user query in the ‘{context}' and ‘{question}' fields respectively. Let's set up the prompt template.
PROMPT_TEMPLATE = """
You are a helpful university professor teaching Algorithms programming and Data structure course.
Try to keep the answers short and to the point unless the question starts with explain, describe, or define.
Answer the question based only on the following context:
{context}
- -
Answer the question based on the above context: {question}
"""
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
Chunking (not normal)
The most basic way to chunk a text is to use the RecursiveTextSplitter which divides the document into equal chunks with some overlaps. This is not ideal as the semantic meaning of each chunk gets mixed up.
As stated above, I plan to chunk each of the PDFs page by page. So it means that each chunk will have different lengths. That is fine as long as each chunk's semantic meaning is isolated.
Let's do some data visualization to see the variation in the page content lengths. This is the number of characters per page in one of the slides

From the image, we can see that there is a wide range of variation in character lengths. Some pages that are more than 600 characters in length are too big to be contained in a chunk (in my opinion). So for those pages, I will recursively divide them into equal chunks with 50% overlap.
Let's initialize the vectorstore and write the code for chunking.
vectorstore = Chroma(persist_directory="./apdr_lectures", embedding_function=embedding_model)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=150
)
for pdf in tqdm(os.listdir(pdf_directory)):
loader = PyPDFLoader(f"{pdf_directory}/{pdf}")
pages = loader.load_and_split()
# iterate over the pages but skip the first page
for page in tqdm(pages[1:]):
# if the page is empty, skip it
if not page.page_content:
continue
# if the page length is more than 600 characters, split it into chunks
if len(page.page_content) > 600:
chunks = text_splitter.create_documents([page.page_content])
vectorstore.add_documents(chunks)
else:
vectorstore.add_documents([page])
I am skipping the 1st page of the slide which just contains the heading of the entire slide and isn't relevant to generate answers to specific queries.
The code is iterating through the list of PDFs and loading them using the PyPDF loader. The slides are split into pages and the pages are chunked if the character length is not exceed 600. If it is, then the content is further chunked in 300 characters with 150 overlaps. Simple enough.
Another technique that came to mind later was the idea of agent-based chunking. Not exactly like Greg's idea but it is a tad bit different. The plan is to add a one-line summary or heading of the entire page as metadata. Our main LLM will generate the heading and will also be embedded using the embedding model. Maybe I will write about this in the next article.
Querying responses using the vectorstore and the LLM
The time has come to finally test the model with queries regarding the subject. The process would be to ask specific questions about a particular topic and to see if the chunk retrieving is done correctly. We also have to check if, given the correct context, the LLM can generate correct answers to the question or not.
Let's define a query and get similar documents according to the embeddings.
query_text = 'When is the exam?'
similar_docs = vectorstore.similarity_search(query_text, 3)
I forgot the date of the exam. let's ask about the model as I am sure it is mentioned in the organizations slide.
3 most similar chunks are retrieved with the query. The contents of the chunks are combined to form a single string. In the PROMPT_TEMPLATE defined above, the combined string is passed into the context field while the query is passed on to the question field.
# Combine context from matching documents
context_text = "nn - -nn".join([doc.page_content for doc in similar_docs])
# Create prompt template using context and query text
prompt = prompt_template.format(context=context_text, question=query_text)
The last part is just to invoke our LLM model with the prompt we created.
response = chat.invoke(prompt)
print(response.content)
The exam is on Tuesday, 23.07.2024, 14:00-16:00.
Improvement plans
The first would be the change the chunking technique using an agent-based system. This is what I plan to do next.
I also believe that using a knowledge graph is more appropriate in this use case. Because knowledge graphs can answer questions which includes global information extraction. Vector databases can normally answer questions locally. Meaning, asking our model to summarize an entire slide would not work as only 3 chunks are extracted using the query. The information from one slide is chunked into much more than 3 chunks so all the information is not given to the LLM thus it cannot produce an effective answer.
To study each slide effectively, I would have to write a script with a summarization prompt to the LLM. This will produce anything I need from the list of slides depending on the prompt. So prompt engineering plays an important role as well.
If you own a paid plan of the ChatGPT, using the latest GPT-4o model can also improve the generation performance of the RAG pipeline.
Techniques such as prompt rewriting, summarization, and reranking of the extracted similar documents can also be done to feed the LLM with the best possible content for our questions.
I will use some of these techniques to test my RAG pipeline in the future. So be sure to follow me if you want to know about it first.
Last words
Evaluation is important when it comes to forming a new RAG pipeline. However, evaluating for a specific use case is not so easy with the Open Source frameworks as they are more effective in a general use case.
Each of the improvements mentioned above needs to be evaluated using qualifiable metrics. It cannot be just, "I feel like the model is doing better".
I am reviewing some evaluation methods to better understand the effectiveness of the changes I am making to my RAG model. I will write something about that as soon as I get some good results to show you all.
Consider clapping and commenting on this article as it helps a lot as a new writer on Medium.
Resources
- langchain- https://www.langchain.com/
- Ollama: https://ollama.com/
- Greg Kamradt (Data Indy): The 5 Levels Of Text Splitting For Retrieval (youtube.com)
- Groq Cloud: https://console.groq.com/
- Github: zubu007 (zubayer hossain) (github.com)