How I Built My First RAG Pipeline

Author:Murphy | View: 21208 | Time: 2025-03-23 11:57:37

LLM hallucinations have been an issue even for tech giants like Google (simply ask Gemini how many rocks are recommended to eat per day… spoiler alert, it's one per day). While we still don't know how to teach LLMs common sense knowledge, what we can do is give them enough context for your specific use case. This is where Retrieval-Augmented Generation (RAG) comes in! In this article, I will walk you through how I implemented a RAG pipeline that can read my resume and talk to recruiters for me!

Psst! If you don't have a membership, you can read the article here.

What is Retrieval-Augmented Generation (RAG)?

First, let's cover our bases and ensure we understand what a RAG is and how it works. In a nutshell, Retrieval-Augmented Generation (RAG) is a technique where an LLM's answer generation is augmented with additional relevant information retrieved from a collection of domain knowledge. The Rag Pipeline picks the most relevant chunk of text from your private data and lets the LLM read it along with the prompt to generate an answer. For example, in this article, I am building a bare-bones chatbot that answers recruiters' questions for me. For the LLM to accurately do its job, I must "tell" it who I am. Using a RAG pipeline, I can let it retrieve the most relevant parts of my resume for every recruiter's question augmenting the LLM's answer generation.

How a Simple RAG Pipeline Works

Now that we covered the high-level theory of how it works, let's dive into the details of it. This is what a simple RAG pipeline looks like:

As shown in the diagram, there are 2 stages in building a simple RAG pipeline:

Data Indexing
Data Retrieval and Generation

Data Indexing

It starts with data indexing, which means converting text data into a searchable database of vector embeddings. First, during the data indexing stage, the collection of documents is split into smaller chunks of text. This way smaller, more precise chunks of text can be fed to the LLM when needed instead to avoid overwhelming it with too much information. Then the chunks of text are transformed into vector embeddings. Vector embeddings encode the meaning of natural language text into numbers that computers can read. And finally, the vector embeddings are stored in a vector database allowing them to be easily searched.

Data Retrieval and Generation

Now that chunks of context data are stored in a searchable database, data retrieval and generation starts. First, the user's query (or prompt) is transformed into a vector embedding, just like the context data in the vector database. Then the query vector is compared against all the vectors for context data in the vector database to select the top k chunks of context data that are most similar to the user's query. Finally, the user query and the selected chunks of context are fed to the LLM and the answer is generated. That's it!

How I Built a Simple RAG Pipeline

Now that we understand the theory behind a RAG pipeline, let's put it into practice!

These are the steps we'll follow:

Set up the environment
Import an LLM
Import an embedding model
Prepare the data
Prompt Engineering
Create the query engine

Setting Up the Environment

First, we need to import all the necessary libraries. We'll use the following ones:

Chroma – an AI-native open-source vector database. Chroma will allow us to create a vector database for the vector embeddings.
LlamaIndex – a framework for building context-augmented generative AI applications with LLMs. LlamaIndex will handle everything from reading the context data to creating vector embeddings, creating a prompt template, and prompting Llama LLM locally.

import chromadb
from llama_index.core import PromptTemplate
from llama_index.core import Settings
from llama_index.core import SimpleDirectoryReader
from llama_index.core import StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.vector_stores.chroma import ChromaVectorStore

To install these libraries, you can run the following commands:

pip install chromadb
pip install llama-index

Importing Llama LLM

Now that all the necessary libraries are imported, we can start by importing an LLM. I opted to use Llama because it allows me to run it locally, which means it's free and private! The Ollama library makes it super easy to import – simply specify which version you want to use and then prompt it by calling .complete.

llm = Ollama(model="llama3")
response = llm.complete("Who is Laurie Voss? write in 10 words")
print(response)

Importing an Embedding Model

Next, we import an embedding model that handles the transformation from text to vector embeddings of the context data and the prompt. There is a huge variety of embedding models that you can choose from. I used "BAAI/bge-small-en-v1.5" from Hugging Face because it is a small model. The smaller the model, the faster the implementation, but at the expense of the models capability. Since my RAG pipeline is only a POC, I don't mind the suboptimal performance and the added speed boost is a plus.

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
Settings.llm = llm
Settings.embed_model = embed_model

Preparing the Data

The embedding model is used as part of data preparation. To prepare the data, we first read the file that contains the context, using SimpleDirectoryReader. In this case, it is a PDF of my one-page Resume. Then we create a vector database using Chroma. Finally, we store the context data as vector embeddings in a vector database and perform text chunking transformation using VectorStoreIndex.

documents = SimpleDirectoryReader(input_files=["./resume.pdf"]).load_data()
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("ollama")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, 
                                        storage_context=storage_context, 
                                        embed_model=embed_model,
                                        transformations=[SentenceSplitter(chunk_size=256, chunk_overlap=10)])

Prompt Engineering

Now that the inner workings of the RAG pipeline are set up, we write a template query that assigns the LLM a task and a persona, provides relevant context, and plugs in the question.

template = (
    "Imagine you are a data scientist's assistant and "
    "you answer a recruiter's questions about the data scientist's experience."
    "Here is some context from the data scientist's "
    "resume related to the query::n"
    "-----------------------------------------n"
    "{context_str}n"
    "-----------------------------------------n"
    "Considering the above information, "
    "please respond to the following inquiry:nn"
    "Question: {query_str}nn"
    "Answer succinctly and ensure your response is "
    "clear to someone without a data science background."
    "The data scientist's name is Diana."
)
qa_template = PromptTemplate(template)

Create the query engine

Lastly, we create a query engine that assembles all the Lego pieces!

query_engine = index.as_query_engine(text_qa_template=qa_template,
                                                          similarity_top_k=3)

Running the RAG Pipeline

Now, to the fun part of building an AI application – seeing it work! To run the RAG pipeline, prompt the query engine with a question, and voila!

response = query_engine.query("Do you have experience with Python?")
print(response.response)

'Yes, I can confirm that Diana Morales has extensive experience working 
with Python as a Data Scientist at Accenture. According to her resume, 
she listed Python as one of her core skills, indicating a strong 
proficiency in the programming language. Additionally, her projects and 
achievements highlight her ability to leverage Python for various data 
science tasks, such as natural language processing (NLP), machine learning, 
and data visualizations.'

Pretty neat, huh?

In my next articles, I will cover RAG pipelines in more depth, touching on more advanced topics. Don't forget to clap and comment if you enjoyed this article and share all the awesome RAG pipelines you built following this article! Till next time!

Tags: Generative Ai Use Cases Rag Pipeline Rags Resume Retrieval Augmented