Improving RAG Performance Using Rerankers

Author:Murphy | View: 23245 | Time: 2025-03-22 21:02:59

Created by author using Stable Diffusion XL

Introduction

RAG is one of the first tools an engineer will try out when building an LLM application. It's easy enough to understand and simple to use. The primary motive when using vector search is to gather enough relevant context so that the output of the LLM is of higher quality.

Although vector search can perform quite well out of the box, there are still many cases where the results don't hit the mark. For example, with vector embedding the top k results might not contain all the relevant information. To compensate for this, top k can be set to a larger value. However, this comes with a new set of problems.

The number of documents exceeds the size of the LLM context window

Even though LLMs support larger context windows, there is only so much information you can fit. The higher the top k value, the more difficult it becomes to fit everything into the context. Although the embeddings are sorted by cosine similarity, this doesn't guarantee that the most pertinent content will be at the top. This is partly because vector search typically relies on pre-computed embeddings, which may not fully capture the query-specific relevance.

These limitations highlight the need for a method to further refine and re-order the retrieved results based on their relevance to the specific query at hand. This is where more sophisticated ranking techniques come into play.

Purpose of rerankers

To address the limitations of vector search, rerankers can be incorporated as an additional data filter in your RAG pipeline. At its core, a reranker is a cross-encoder model designed to compute the similarity between two pieces of text with high precision. Rerankers are trained on pairs of text with ranking-specific loss functions that directly optimize for ranking performance.

Rerankers output a score that represents the similarity between texts. This score is calculated by directly comparing the pieces of text, so it tends to be more accurate than computing semantic similarity between vectors. Rerankers can consider factors like semantic meaning, context, and even subtle implications that might be missed by vector embeddings alone.

Rerankers compute the score between two pieces of text

So if rerankers are so good, why even use vector embeddings in the first place?

Rerankers, while more accurate, are significantly slower than vector similarity calculations. For this reason, rerankers are often used as a second pass after the vector similarity has been computed. This two-stage approach combines the speed of vector search with the precision of rerankers. Here is what the flow would look like:

The first stage is semantic similarity and the second is using the reranker

Using rerankers in practice

In this section, I'll show how to use an open-source reranker (BGE Reranker Large) on some documents, in this case, stories from the Odyssey. We'll do the following steps:

Extract text from the document and break it into multiple chunks
Create vector embeddings for each chunk of text
Retrieve similar embeddings to the input query
Rerank the embeddings to get the most relevant content

Installing and import python requirements

pip install PyMuPDF
pip install pytesseract
pip install langchain

import fitz
from PIL import Image
import pytesseract
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer, AutoModel
import torch

For this tutorial, we'll be using the pytesseract python package to perform OCR on the text. Then we'll use langchain's text splitter to create chunks of text.

Document parsing and chunking

def parse_document(document_path: str):
  texts = []
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=40)

  pdf_document = fitz.open(document_path)
  page_numbers = list(range(1, 39))
  for page_number in page_numbers:
    page = pdf_document.load_page(page_number)
    pix = page.get_pixmap(dpi=300)
    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
    text = pytesseract.image_to_string(img)
    chunked_texts = text_splitter.split_text(text)
    texts.extend(chunked_texts)
  return texts

What's happening here:

We create an empty list called texts which will hold our chunked text. We also instantiate a RecursiveCharacterTextSplitter from langchain to make chunking easier.
Next, using fitz we can open up any PDF file.
Then, we process the PDF page by page. Here pytesseract is used to extract the text from each image.
Lastly, the extracted text is chunked using the Langchain text splitter.

Setting up the embedding model

An embedding model helps to quickly find the most similar pieces of text to a query. Open-source embedding models such as BGE large can provide great results.

def setup_embedding_model():
  tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-en-v1.5')
  model = AutoModel.from_pretrained('BAAI/bge-large-en-v1.5')
  model.eval()
  model.to("cuda")
  return tokenizer, model

def create_embedding(texts, tokenizer, model):
  encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt').to("cuda")
  with torch.no_grad():
    model_output = model(**encoded_input)
    sentence_embeddings = model_output[0][:, 0]
    sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
    return sentence_embeddings.tolist()

What's happening:

The setup_embedding_model function downloads the bge-large-en-v1.5 model and tokenizer from hugging face and loads it into GPU memory.
The create_embedding takes a list of texts as input and outputs a list of vector embeddings by passing it through the model.

Testing the output with just embeddings

Before we implement the reranker portion, let's see what results we get with just embeddings. This will serve as a baseline and will allow us to compare the results with the reranker.

chunked_texts = parse_document("./odyssey_stories.pdf")
embedding_tokenizer, embedding_model = setup_embedding_model()
embeddings = create_embedding(chunked_texts, embedding_tokenizer, embedding_model)

query = "Why was Odysseus stuck with Calypso?"
query_embedding = create_embedding([query], embedding_tokenizer, embedding_model)

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(query_embedding, embeddings)
similarity = similarity[0]

indexed_numbers = list(enumerate(similarity))
sorted_indexed_numbers = sorted(indexed_numbers, key=lambda x: x[1], reverse=True)
sorted_indices = [index for index, number in sorted_indexed_numbers]

top_k = 10
print(f"Original query: {query} n")
for i in sorted_indices[:top_k]:
  print(texts[i])
  print("n")

What's happening:

Using the functions defined in the previous steps we chunk the text and create a vector embedding for each chunk.
The input query, Why was Odysseus stuck with Calypso? also gets converted to an embedding so that we can check the semantic similarity with the rest of the chunks.
Using cosine similarity the top k closest vectors to the input query are printed out.

Here is the output from running the code above:

Original query: Why was Odysseus stuck with Calypso?

Calypso of the braided tresses was a goddess feared by
all men. It was to her island that the piece of wreckage to
which Odysseus clung drifted on the ninth dark night after his
ship was wrecked.

At night the island looked black and gloomy, but at
morning light, when Odysseus felt life and strength coming
back to him, he saw that it was a beautiful place.

When he had reached the island of Calypso, he walked
through the meadows of violets to the cave. But Odysseus was
not there. Down by the rocky shore he sat, looking wistfully
over the wide sea, while the tears rolled down his face and
dripped on the sand. Calypso was in the cave, weaving with
her golden shuttle, and singing a sweet song. Food and wine

Odysseus knew that Calypso was a goddess that all
men feared, but he soon found that he had nothing to fear from
her, save that she should keep him in her island for evermore.
She tended him gently and lovingly until his weariness and
weakness were gone and he was as strong as ever.

For nine days and nights he was tossed by the waves.
On the night of the ninth day the mast drifted to the shores of
an island, and Odysseus, little life left in him, crawled on to
the dry land.

Then Odysseus told her of his imprisonment in the
island of Calypso, of his escape, of the terrible storm that
shattered his raft, and of how at length he reached the shore
and met with Nausicaa.

"It was wrong of my daughter not to bring thee to the
palace when she came with her maids," said the king.

But Odysseus told him why it was that Nausicaa had
bade him stay behind.

But although he lived by the meadows where the
violets and wild parsley grew, and had lovely Calypso to give
him all that he wished, Odysseus had a sad and heavy heart.

"Stay with me, and thou shalt never grow old and never
die," said Calypso.

While all of the chunks are about Odysseus and Calypso, the top 5 chunks don't really contain the information required to answer the question. Vector search can help find the most similar content which is not necessarily the most relevant content. When embeddings are created, there is some information in the original text that gets lost. Because of this, there is a chance that the top k vectors don't contain all of the relevant information.

Adding the second stage: rerankers

To solve the relevancy issue, we can increase top k to a larger number. But now, there would be too much information and it may not fit inside the context window of an LLM.

To filter out the most relevant content from the top k chunks, we can use rerankers. To work nicely with the embedding model, BAAI has also created a reranker model BAAI/bge-reranker-large.

def setup_reranker():
  tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large')
  model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-large')
  model.eval()
  model.to("cuda")
  return tokenizer, model

def run_reraker(text_pairs, tokenizer, model):
  with torch.no_grad():
      inputs = tokenizer(text_pairs, padding=True, truncation=True, return_tensors='pt', max_length=512).to("cuda")
      scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
      return scores.tolist()

reranker_tokenize, reranker_model = setup_reranker()

pairs = []
for index in sorted_indices[:top_k]:
  pairs.append([query, texts[index]])

scores = run_reraker(pairs, reranker_tokenize, reranker_model)
paired_list = list(zip(sorted_indices[:top_k], scores))
sorted_paired_list = sorted(paired_list, key=lambda x: x[1], reverse=True)
reranked_indices = [index for index, value in sorted_paired_list]
reranked_values = [value for index, value in sorted_paired_list]

print(f"Original query: {query} n")

for i in reranked_indices:
  print(texts[i])
  print("n")

What's happening:

Similar to how the embedding model was set up, the reranker is also instantiated.
The reranker expects pairs of text as input. pairs is a list with each element containing the input query text and text from the top k chunks.
This reranker outputs a list of scores that represents the relevancy between the input query and the chunk of text. The scores and text chunks are zipped together and sorted so that the most relevant chunks are at the top of the list.

Here is the output after reranking:

Original query: Why was Odysseus stuck with Calypso? 

But a great homesickness was breaking the heart of
Odysseus. He would rather have had one more glimpse of his
rocky little kingdom across the sea, and then have died, than
have lived for ever and for ever young in the beautiful,
flowery island.

"Stay with me, and thou shalt never grow old and never
die," said Calypso.

Day after day he would go down to the shore and stare
with longing eyes across the water. But eight years came and
went, and he seemed no nearer escape.

Then the god of the sea stirred up against him a wave
more terrible than any that had gone before, and with it smote
the raft. Like chaff scattered by a great wind, so were the
planks and beams of the raft scattered hither and thither. But
Odysseus laid hold on a plank and bestrode it, as he might
have ridden a horse. He stript off his wet clothes and wound

Calypso of the braided tresses was a goddess feared by
all men. It was to her island that the piece of wreckage to
which Odysseus clung drifted on the ninth dark night after his
ship was wrecked.

At night the island looked black and gloomy, but at
morning light, when Odysseus felt life and strength coming
back to him, he saw that it was a beautiful place.

Odysseus knew that Calypso was a goddess that all
men feared, but he soon found that he had nothing to fear from
her, save that she should keep him in her island for evermore.
She tended him gently and lovingly until his weariness and
weakness were gone and he was as strong as ever.

But although he lived by the meadows where the
violets and wild parsley grew, and had lovely Calypso to give
him all that he wished, Odysseus had a sad and heavy heart.

After using the reranker, we have mostly the same chunks that we got from doing semantic search. However, the position of each chunk has changed. The top 5 reranked chunks are much more relevant to the input query than the results from the semantic search earlier.

Conclusion

Going back to the big picture, the main reason to use RAG is to improve the output quality of the LLM. Using vector embeddings with semantic similarity is a quick and effective way to boost RAG performance. However, when working at scale with millions of documents, vector search can often fall short, and gathering the relevant context becomes difficult.

To further improve the performance of the RAG pipeline, rerankers can be easily integrated to ensure the most relevant content is fed to the LLM. Rerankers give you improved accuracy but come at the cost of additional latency. For this reason, the two-stage RAG pipeline efficiently uses both semantic search and rerankers to get a good balance of both speed and accuracy.

That's it for now, peace out.