Augmenting LLMs with RAG

Author:Murphy | View: 27054 | Time: 2025-03-23 12:28:52

I've written quite a few blogs on Medium around different technical topics, and more heavily around Machine Learning (ML) Model Hosting on Amazon SageMaker. I've also lately developed an interest for the growing Generative AI/Large Language Model ecosystem (like everyone else in the industry lol).

These two different verticals led me to an interesting question. How good are my Medium articles at teaching Amazon SageMaker? To answer this I decided to implement a Generative AI solution that utilizes Retrieval Augmented Generation (RAG) with access to some of my articles to see how well it could answer some SageMaker related questions.

In this article we'll take a look at building an end to end Generative AI solution and utilize a few different popular tools to operationalize this workflow:

LangChain: Langchain is a popular Python framework that helps simplify Generative AI applications by providing ready made modules that help with Prompt Engineering, RAG implementation, and LLM workflow orchestration.
OpenAI: LangChain will take care of the orchestration of our Generative AI App, the brains however is still the model. In this case we use an OpenAI provided LLM, but LangChain also integrates with different model sources such as SageMaker Endpoints, Cohere, etc.

NOTE: This article assumes an intermediate understanding of Python and a basic understanding of LangChain in specific. I would suggest following this article for understanding LangChain and building Generative AI applications better.

DISCLAIMER: I am a Machine Learning Architect at AWS and my opinions are my own.

Problem Overview

Large Language Models (LLMs) by themselves are incredibly powerful and can often answer many questions without assistance from fine-tuning or additional knowledge/context.

This however can become a bottleneck when you need access to other specific sources of data and especially recent data. For example, while OpenAI has been trained on a large corpus of data it does not have knowledge of my recent Medium articles that I have written.

In this case we want to check how well my Medium articles can help assist with answering questions about Amazon Sagemaker. OpenAI's models already do have some knowledge of Amazon SageMaker from the corpus they have been trained on. What we want to see is how much performance we can gain by providing these LLMs with access to my Medium articles. These can serve almost as a cheat sheet of sort for LLMs that already have a large knowledge bank.

How do we provide these LLMs access to this additional knowledge and information?

Why We Need RAG

This is where Retrieval Augmented Generation (RAG) comes into play. With RAG we provide an information retrieval system that gives us access to the additional data that we need. This will help us answer more advanced SageMaker questions and augment our LLMs knowledge base. To implement a basic RAG system we need a few components:

Embeddings Model: For the data we provide access to, this can't simply be just a bunch of text or images, rather they need to be captured in a numeric/vector format for all NLP models (including LLMs) to understand. To transform our data, we utilize the OpenAI Embeddings Model, but there are a variety of different choices such as Cohere, Amazon Titan, etc that you can evaluate for performance.
Vector Store: Once we have our embeddings, we need to utilize a vector datastore that not only stores these vectors but also provides an efficient manner in which we can index and retrieve relevant data. When a user has a query, we want to return any relevant context that contains similarity to this input. Most of these vector stores are powered by KNN and other nearest neighbor algorithms to provide relevant context for the initial question. In the case of this solution we utilize the Facebook library FAISS, which can be utilized for efficient similarity search and clustering of vectors.
LLM Model: We have two models in this case, the embeddings model to create the embeddings, but we still also need the main LLM that takes these embeddings and the user input to return an output. In this case also we use the default ChatOpenAI model.

Essentially you can think of RAG as a performance enhancer of LLMs by providing extra knowledge that the base LLM might not already have. In the next section we'll take a look at how we can implement these concepts utilizing LangChain and OpenAI.

Generative AI Application & Sample Inference

To get started you need an OpenAI API Key, which you can find and install at the following link. Note the charges per rate/API limit so that you have an understanding of the pricing structure. For development we work in a SageMaker Classic Notebook Instance, but any environment with OpenAI and LangChain installed should be sufficient.

import os
os.environ['OPENAI_API_KEY'] = 'Enter your API Key here'

# necessary langchain imports
import langchain
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings.cache import CacheBackedEmbeddings
from langchain.vectorstores import FAISS
from langchain.storage import LocalFileStore
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

After setting up LangChain and OpenAI, we create a local directory with ten of my popular Medium articles stored as PDFs. This will be the additional data/information we are making available to my LLM.

As next steps we need to be able to load this data and also create a directory for where we can store the embeddings that we generate. LangChain has lots of utilities that will automatically load and also split/chunk your data for you. Chunking is specifically important as we don't want larger sets of data for the embeddings we generate. The larger the data the more potential noise that can be introduced.

In this case we use the LangChain provided PDF loader to load and split our data.

# where our embeddings will be stored
store = LocalFileStore("./cache/")

# instantiate a loader: this loads our data, use PDF in this case
loader = PyPDFDirectoryLoader("sagemaker-articles/")

# by default the PDF loader both loads and splits the documents for us
pages = loader.load_and_split()
print(len(pages))

We then instantiate our OpenAI Embeddings Model. We use the Embeddings model to create our embeddings and populate the local Cache directory we have created.

# instantiate embedding model
embeddings_model = OpenAIEmbeddings()

embedder = CacheBackedEmbeddings.from_bytes_store(
    embeddings_model,
    store
)

Embeddings Generated (Screenshot by Author)

We then create our FAISS Vector Store and push our embedded documents.

# create vector store, we use FAISS in this case
vector_store = FAISS.from_documents(pages, embedder)

We then use a RetrievalQA Chain to bring all these moving parts together. We specify our vector store we have created above and also pass in the ChatOpenAI default LLM as our model that will recive both the input and the relevant documents for context.

# this is the entire retrieval system
medium_qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(),
    retriever=vector_store.as_retriever(),
    return_source_documents=True,
    verbose=True
)

We then can compare performance of the model without RAG as opposed to our RAG based chain by passing in the same prompts and observing results. Let's run a loop of sample prompts of varying difficulties.

sample_prompts = ["What does Ram Vegiraju write about?",
                 "What is Amazon SageMaker?",
                 "What is Amazon SageMaker Inference?",
                 "What are the different hosting options for Amazon SageMaker?",
                 "What is Serverless Inference with Amazon SageMaker?",
                 "What's the difference between Multi-Model Endpoints and Multi-Container Endpoints?",
                 "What SDKs can I use to work with Amazon SageMaker?"]

for prompt in sample_prompts:

    #vanilla OpenAI Response
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=prompt,
        max_tokens = 500)

    # RAG Augmented Response
    response_rag = medium_qa_chain({"query":prompt})

We see the first question itself is very specific to my writing. We know the OpenAI model does not have any access or knowledge of my articles, thus it outputs a pretty random and inaccurate description.

Alternatively, our RAG chain has had access to some of my Medium articles and produces a somewhat accurate summary of my writing.

We can then test both approaches by asking some SageMaker specific questions. We start with a very basic question: What is Amazon SageMaker? As the OpenAI LLM has knowledge of this matter it responds with a fairly accurate and comparable response to our RAG based approach.

OpenAI vs RAG Response (Screenshot by Author)

Where we start seeing the real benefits of RAG is as the questions start getting more specific and difficult. An example of this is the prompt comparing the two advanced hosting options: Multi-Model Endpoints (MME) and Multi-Container Endpoints (MCE).

Here we see that the Vanilla OpenAI response gives a completely inaccurate answer, it has no knowledge of these two recent capabilities. My specific Medium article around MCE vs MME, however gives the model context around these offerings and it's thus able to answer the query accurately.

With RAG we can augment the basic knowledge our LLM already has around SageMaker. In the next section we can look at different methods to improve this prototype that we have built.

How Can We Improve Performance?

While this is a neat solution, there's still a lot of room for improvement to scale this up. A few potential methods you can use to improve RAG based performance include the following:

Data Size and Quality: In this case we've only provided ten Medium articles and still see solid performance. To boost this we could also provide access to my entire set of Medium articles or anything with the tag "SageMaker" for instance. We've also directly copied my articles without any formatting and the PDFs themselves are very unstructured, cleaning up the data format can help make chunking and therefore performance more optimal. NOTE: It's also essential that the data you use should only rely on resources/articles that you are allowed to use for your purpose. In this example there's no issue with my Medium articles as the source, but always make sure to ensure that you are using data in an authorized manner.
Vector Store Optimization: In this case we've utilized the default FAISS vector store setup. Items you can tune are the speed of the vector store indexing as well as the number of documents to retrieve and provide to your LLM.
Fine-Tuning vs RAG: While RAG helps attain domain specific knowledge, fine-tuning is also another method to help an LLM attain a specific knowledge set. You want to evaluate your use-case here as well to see if fine-tuning makes more sense or a combination of both. Generally fine-tuning is very performant if you have quality data available. In this case with RAG we didn't even necessarily format or shape our data, yet we were still able to yield strong results. With fine-tuning data availability AND quality is essential. For a full breakdown of both options please refer to the following article.

Additional Resources & Conclusion

LangChain-Samples/Medium-SageMaker-Analyzer at master · RamVegiraju/LangChain-Samples

The code for the entire example can be found at the link above. This was a fun project to evaluate the value of my articles while also showing how to integrate RAG into your Generative AI applications. In coming articles we'll continue to explore more Generative AI and LLM driven capabilities.

As always thank you for reading and feel free to leave any feedback.

If you enjoyed this article feel free to connect with me on LinkedIn and subscribe to my Medium Newsletter.

Tags: Generative Ai Tools Langchain Llm Machine Learning Sagemaker