How to Make a RAG System to Gain Powerful Access to Your Data

Author:Murphy | View: 25003 | Time: 2025-03-22 22:23:54

A RAG system is an innovative approach to information retrieval. It utilizes traditional information retrieval approaches like vector similarity search combined with state-of-the-art large language model technology. Combined, these technologies make up a robust system that can access vast amounts of information from a simple prompt.

ChatGPT image of a RAG system. Image by ChatGPT. "can you make an illustration of a rag system, illustrate the point of a computer accessing a knowledge base" prompt. *ChatGPT*, 4, OpenAI, 17 Mar. 2024. https://chat.openai.com.

Motivation

My motivation for this article is my frustration when trying to find an old mail. I typically have some information about the mail, like who the correspondent was or vaguely what the topic of the email was. Still, I must be more specific when doing a direct word search in Gmail, which makes finding the specific mail I am looking for challenging. I would like to have a RAG system that allows me to prompt my emails to search for them. So, if I needed an old email from my university about a subject, I could prompt something like "What technical course did I enroll in during my second year at NTNU?". An equivalent direct word search to this prompt is challenging since I need more specific information in my prompt. Instead, a RAG system could find the mail, given it has all the required data.

This tutorial follows a pipeline. First, I will show you how to retrieve some data for RAG. Then, you will pre-process the data before implementing RAG and testing the system. Image by the author.

· Motivation · Retrieve the data · Pre-process data · Other options · Implementing RAG ∘ Prepare the data ∘ Load the LLM ∘ Use the LLM · Testing ∘ Test 1 ∘ Test 2 · Future work · Conclusion

Retrieve the data

The first step in making an RAG system is finding the data you want your RAG system to use. In my case, I want to search emails, though the approach I will discuss will also apply to any other data source. I downloaded all my emails, which you can find on takeout.google.com. There, you can select only to download your mail and press export. This will then send a link to your email, from which you can download all your mail information. If you have another email client, there are similar approaches to downloading all your email information.

Furthermore, there are other data sources you could use in this tutorial rather than emails. Some examples are:

Medium articles
Online documentation, for example, Langchain documentation
PDF documents

The common factor among these examples is that they contain textual information. However, you could also apply this information to other types of information, like images or audio. The challenge for the RAG system is not searching the data since you can easily vectorize different modalities of data like text, photo, and audio. Instead, the more significant issue will be generating text, audio, or images, though there are methods for that. However, this article will only discuss generating textual answers from textual information.

Pre-process data

After downloading your data, it is time to pre-process it, an essential step in most machine-learning pipelines. I downloaded my emails, so my pre-processing step will feature how to pre-process downloaded emails, though you can apply the same thought process with other types of data.

After downloading the data from takeout.google.com, you have a zip folder in your downloads folder. Extract the zip folder, and find the file called All mail Including Spam and Trash.mbox, which contains all the necessary mail information. Move this file to the programming folder you are working with, and then you can use the following code partly from StackOverflow to read out the most essential information from the file:

First, install and import packages:

# install packages
!pip install beautifulsoup4
!pip install pandas
!pip install tqdm

# import packages
import pandas as pd
import email
from email.policy import default
from tqdm import tqdm
from bs4 import BeautifulSoup #to clean the payload

Then make a class to read the mbox file:

# code for class from https://stackoverflow.com/questions/59681461/read-a-big-mbox-file-with-python/59682472
class MboxReader:
    def __init__(self, filename):
        self.handle = open(filename, 'rb')
        assert self.handle.readline().startswith(b'From ')

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_value, exc_traceback):
        self.handle.close()

    def __iter__(self):
        return iter(self.__next__())

    def __next__(self):
        lines = []
        while True:
            line = self.handle.readline()
            if line == b'' or line.startswith(b'From '):
                yield email.message_from_bytes(b''.join(lines), policy=default)
                if line == b'':
                    break
                lines = []
                continue
            lines.append(line)

Then, you can extract the mail information to a pandas dataframe with:

path = r"/All mail Including Spam and Trash.mbox"
mbox = MboxReader(path)

MAX_EMAILS = 5
current_mails = 0

all_mail_contents = ""
mail_from_arr, mail_date_arr, mail_body_arr = [],[],[]
for idx,message in tqdm(enumerate(mbox)):
    # print(message.keys())
    mail_from = f"{str(message['From'])}n".replace('"','').replace('n','').strip()
    mail_date = f"{str(message['Date'])}n".replace('"','').replace('n','').strip()
    payload = message.get_payload(decode=True)
    if payload:
        current_mails += 1
        if current_mails > MAX_EMAILS:
            break
        soup = BeautifulSoup(payload, 'html.parser')
        body_text = soup.get_text().replace('"','').replace("n", "").replace("t", "").strip()
        mail_from_arr.append(mail_from)
        mail_date_arr.append(mail_date)
        mail_body_arr.append(body_text)
        all_mail_contents += body_text + " "

Where:

is the root path to the mbox file referenced earlier
MAX_EMAILS is the number of emails you want to extract. Note that I am extracting the body (payload) of the email, which all emails do not contain. Therefore, some emails will be skipped and thus not counted for the maximum number of emails you want to extract

This code will save the mail information to arrays and a string of all mail contents. You can save this to file with:

df = pd.DataFrame({'From':mail_from_arr, 'Date':mail_date_arr, 'Body':mail_body_arr})
df.to_pickle("df_mail.pkl")

# write all mail contents to txt
with open("all_mail_contents.txt", "w", encoding="utf-8") as f:
 f.write(all_mail_contents)

All the information you need is now in files and ready for a RAG system to retrieve.

Other options

Before I start implementing the RAG system, I would also like to mention some less fancy but still reliable alternatives to RAG. RAG is a search system from the information retrieval field. Information retrieval is essential as it allows us to access vast available data. Technology from information retrieval is, for example, used in any search engine like Google. TF-IDF was implemented in 1972 and is still a reliable information search algorithm. I have implemented a search engine system with it myself, which you can see in the article below, and you will be impressed by how well the algorithm still works.

TF-IDF in Python, index your data, and do inference!

BM25 is an improvement to TF-IDF that can be easily implemented by making a slight modification to the TF-IDF code. My point about mentioning the other information search options like TF-IDF and BM25 is that a full RAG system might not always be necessary. If you simply need direct information access, then simpler options out there still work well.

Implementing RAG

It is now time to implement the RAG system. I will use the Langchain framework, an excellent tool for quickly setting up a customizable RAG system. I followed the Langchain documentation and customized the code for custom data to create this section.

First, you must install and load all required packages:

# install packages
!pip install langchain
!pip install gpt4all
!pip install chromadb
!pip install llama-cpp-python
!pip install langchainhub

# import packages
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import GPT4AllEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.llms import LlamaCpp
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain.docstore.document import Document
from langchain import hub
from langchain_core.runnables import RunnablePassthrough, RunnablePick
import pandas as pd

Prepare the data

Then, load the dataset you made in the pre-process data step. Note that I am only using the mail contents in this case, but you can also add additional information like the sender and the data of each email.

# load custom dataset
with open("all_mail_contents.txt", "r", encoding="utf-8") as f:
 all_mail_contents = f.read()

You then have to convert the string into a format Langchain can read. The code below first converts all your text to a Langchain Document format and initializes a text splitter, which splits the string into different chunks with an overlap. These chunks are essential, as when you ask the RAG system a question, the most relevant chunks will be retrieved and given to the LLM as context to answer the question. You also have some overlap between the chunks, so each can contain enough information to answer a given question. After splitting into chunks, each chunk is vectorized (converted to numbers). This is a typical information retrieval step, where the data you are searching for is vectorized. Then, when you ask a question, that question is also vectorized, and you choose the most relevant data to a question by taking the vectors from the data with the lowest distance to the question vector.

# convert to langchain document format
doc =  Document(page_content=all_mail_contents, metadata={"source": "local"})
#split up
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
all_splits = text_splitter.split_documents([doc])
vectorstore = Chroma.from_documents(documents=all_splits, embedding=GPT4AllEmbeddings())

Note: I had a problem running the last line above on Python3.8, but it works off the shelf with Python3.11. The issue is due to the version of sqllite3 (a preinstalled package with Python) in Python 3.8.

Vectorization

The vectorization process works as follows. I am giving an example here with randomly chosen numbers. First, you vectorize each document:

Then you vectorize the question:

Image showcasing vectorization of a question. Image by the author

Then you find the similarity between the question and each document; for example, with cosine similarity, this outputs a single array of numbers between 0 and 1 like below:

Image showing similarities between a question and documents. Image by the author.

You then choose the documents that are most similar to the question. If you want two documents, you should select documents 1 and 3.

Load the LLM

After preparing the data, you must load your large language model. There are plenty of options for language models, but I will focus on free models. Langchain is a wrapper that enables easy use of LLMs, but you still have to download the LLM yourself. However, this does not have to be a complicated process, and I will show you two ways of doing it.

The first option, and my preferred choice, is Llama2, which you can learn to download and use with my tutorial below:

Downloading and running Llama2 for Windows

You must download the model and convert it to a .gguf format, which Langchain can easily use. You can then use the model with:

n_gpu_layers = 1 
n_batch = 512 

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path=r"ggml-model-f16_q4_1.gguf",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    n_ctx=1024,
    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls
    verbose=True,
)

Where the model path is the path to your model, and n_ctx is the context size you want to use. A larger context size will be better but also require more computing, a trade-off you must consider.

The second, more accessible option is to use GPT4All, which the Langchain tutorial mentions. Despite what the name suggests, GPT4All has open-source models you can use. You can then download a Tiny LLama HuggingFace (the _tinyllama-1.1b-chat-v1.0.Q5_KM.gguf model). You can also look for larger Llama models if you want increased performance, though you can still get decent results with a smaller model.

After you have downloaded the model, move it to your local folder. You can then load the model with:

from langchain_community.llms import GPT4All

llm = GPT4All(
    model=r"tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf",
    max_tokens=2048,
)

You are then ready to go with your LLM!

Use the LLM

If you want to use your language model, you can invoke it like this:

question = "What is Medium"
llm.invoke(question)

In which my Llama2 model gives a response of:

Medium is a platform that allows writers to share their thoughts, ideas, and stories with the world. It was founded by Evan Williams, one of the co-founders of Twitter, and has become a popular place for writers to publish long-form content, including articles, essays, personal stories, and more. Here are some things you can do on Medium:n1. Publish your writing: The most obvious thing you can do on Medium is to publish your writing. Whether you want to write about a personal experience, share your expertise in a particular field, or tell a story, Medium provides a platform for you to do so.n2. Follow other writers: Medium has a large community of writers, and you can follow other writers whose work interests you. This way, you'll see their new publications in your feed and can discover new voices and perspectives.n3. Read and engage with publications: Medium publishes a wide variety of content, including articles, essays, personal stories, and more. You can read and engage with these publications by commenting, liking, or sharing them with others.

Another exciting function you can use is the similarity search. The similarity search works as a standard search algorithm, returning the most similar data points to your question. This works by vectorizing the question and the data points and choosing the ones with the smallest distance to the question. You can do the similarity search with:

#this code can be used to see if the correct documents are retrieved. The documents retrieved should be regarding your questions, and is the data the LLM uses to answer the questions
question = "What is Google"
docs = vectorstore.similarity_search(question)

This can be a valuable tool for debugging whether the RAG system retrieves documents as intended. You should see relevant documents to the question you asked.

It would be best if you then had a helper function to format the documents:

def format_docs(docs):
    return "nn".join(doc.page_content for doc in docs)

And form a Langchain chain to help you ask questions. The _ragprompt adds additional text to your question to ensure the LLM uses the retrieved documents to help answer the question.

# retrieve relevant docs
rag_prompt = hub.pull("rlm/rag-prompt")
rag_prompt.messages

retriever = vectorstore.as_retriever()
qa_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

Finally, you can then have the LLM answer a question using retrieved documents with the following:

qa_chain.invoke(question)

This line of code invokes the chain you created, which retrieves the most relevant documents to your question and then prompts the LLM to answer your question, given the retrieved documents.

Testing

Testing the RAG system quantitatively can be difficult, as creating a labeled dataset is labor-intensive. Thus, I will do qualitative testing in this article to check its performance. Note that the implementation above is simple, and you can expect something other than high-level performance.

I tested my RAG system by first finding a few emails I wanted to discover and then seeing if the RAG system could retrieve the information I needed. It is important to note that many of my emails will be written in Norwegian, which makes some tasks more difficult for the RAG system, though it is still possible. I will not share examples here since I want to keep my emails private, but you can try this out yourself. To do these tests, I gathered around 100 emails containing a body and asked questions relating to these emails.

Test 1

I have an email from Digipost, a Norwegian digital postal service. I want to see if the RAG system can determine whether I received such an email.

I ask the question:

question = "Have I gotten a message from Digipost?"

And the RAG system answers:

'Yes, you have received a message from Digipost.'

So, the LLM passed the test. This test is interesting, as document retrieval manages to retrieve relevant emails since Digipost is an uncommon word. However, the email about Digipost is in Norwegian, which makes this more impressive.

Test 2

I have gotten an email from Microsoft about the Cloud Skill Challenge, and so I asked the RAG:

question = "What is the topic of my last Microsoft email I have gotten?"

And it answers:

'The topic of your last Microsoft email is learning content.'

As you can see, the RAG is correct again, though this time's answer is vague and not optimal.

Since I thought the answer was a bit vague, I tried to make the LLM answer in more detail:

question = "What is the topic of my last Microsoft email I have gotten? Answer in detail"

In which it answered:

'nBased on the provided context, the topic of your last Microsoft email is likely related to learning paths and modules in Microsoft Learn, with a mention of trending content and contact information for feedback or help.'

So the system works!

Future work

Though the system works as intended, I would like to add many features to this RAG system. Here are a few future works I am thinking about working on:

Adding more information about each email, like the sender and the date. This will give the RAG system more information to work with
Adding an option so the RAG can return a specific mail. Given a question like: "Can you find the email regarding … that arrived around June 23?" I want it to give me a specific mail (or list of emails) that could be the mail I am looking for
Better splitting of the data. Right now, I am combining the texts from all emails, but instead, I can divide each email into one chunk and access information that way.
Larger context windows. Increase the context window for the LLM, and allow for follow up questions, for example if the RAG model give not give a detailed enough answer.

Conclusion

In this article, I have discussed how you can create an RAG system using your data. Given a prompt, this can be a powerful search engine to retrieve the information you seek. To make the system, you first retrieve the data, for example, by downloading your mail from Gmail. You then have to pre-process the data to make it accessible to the RAG system. After this, all the data will be converted to Langchain format and loaded into an LLM. Then, you can ask questions regarding the data you downloaded. I also included two tests showing how the RAG system responds to queries and showcasing that the system is working as intended.

You can find the code used in this article on my GitHub.

You can find part 2 of this article below: