5 Proven Query Translation Techniques To Boost Your RAG Performance

Author:Murphy | View: 24394 | Time: 2025-03-23 11:54:47

Photo by travelnow.or.crylater on Unsplash

You can't be more wrong than assuming the user would ask the LLM the perfect questions. Rather than directly executing, what if we refine the user's problem? That's query translation.

We built an app that lets users query through all the documents my company ever produced. These include PPTs, project proposals, progress updates, deliverables, documentation, etc. It was remarkable because many such attempts in the past fell short. Thanks to RAGs, this time, it was very promising.

We did a demo, and everyone was excited to use it. The initial rollout was for a small, selected batch of staff. But what we noticed wasn't very exciting to us.

This was expected to be a game-changer in the way we work. But most users tried the app only a few times and never used it later. They quit the app as if it were a toy project for school kids.

The logs showed satisfactory results. However, we spoke to the real users who used the app to determine the real issue. The lessons we learned led us to think about query translation techniques to overcome ambiguity in user inputs.

Here's an example situation.

One user is interested in the fashion-related businesses we've advised one of our clients, "XYZ," to acquire. His input was, "What are the fashion-related acquisitions made by XYZ partners?" The app searched through our deliverable PPTs and came up with a list of a dozen companies. Yet, the list was too different from what the user expected. XYZ partners have acquired (say) 7 fashion stores. But the list we've got had only 4. The user, who is also a tester, is well aware of how many acquisitions there have been.

No wonder people stopped using the tool. But thanks to the phased-out rollout technique, the lost trust is reversible.

We've made a series of changes to the app to fix this issue. One crucial update was query translations.

This post intends to introduce our different query translation techniques but not a deep dive. For instance, some of these techniques can be combined with prompting techniques such as few-shot prompting and chain of thoughts to get better results. However, these techniques are for another post in the future.

Advanced Recursive and Follow-Up Retrieval Techniques For Better RAGs

Let's explore the techniques one by one. But before then, here's a basic RAG example.

Basic RAG Example

All RAG applications would have at least one database, often a vector store and a language model. The basic idea behind RAG is simple. Before the LLM answers the user's question with its prior knowledge, it searches a database for contextual information and generates more accurate responses.

The following diagram illustrates the simplest RAG app possible.

Basic RAG application Workflow – Image by the author.

In a simple RAG application, only one communication with your LLM model exists. This can be OpenAI GPT models, Cohere, or even your locally hosted model.

The following code implements the steps in the diagram. We will use this as our base to build on the other techniques in this post.

Python"># This is to securely load our secrets
from dotenv import load_dotenv
load_dotenv()

# 1. Load the content
# -----------------------------------------
import bs4
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader(
    web_paths=("https://docs.djangoproject.com/en/5.0/topics/performance/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            id="docs-content"
        )
    ),
)
doc_content = loader.load()

# 2. Indexing
# -----------------------------------------
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000, chunk_overlap=200
)
docs = text_splitter.split_documents(doc_content)

vector_store = Chroma.from_documents(documents=docs, embedding=OpenAIEmbeddings())
retriever = vector_store.as_retriever()

# 3. LLM 
# -----------------------------------------
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(temperature=0.5)

# 4. RAG Chain
# -----------------------------------------
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

prompt = """
Answer the question in the below context:
{context}

Question: {question}
"""

prompt_template = ChatPromptTemplate.from_template(prompt)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt_template
    | llm
    | StrOutputParser()
)

# 5. Invoking the chain
# -----------------------------------------
response = chain.invoke(
    "How can I improve site speed?",
)

print(response)

In the above code, we've used a web-based loader to load a page from Django's documentation page and store it in a Chroma vector store. Instead of the documentation page, you can play around with different web pages, local text files, PDFs, and many more.

Since we don't use a sophisticated retrieval technique, we passed the retriever straight to the final RAG chain. We will pass another retriever chain instead of the retriever itself in our subsequent techniques. The rest of the article is about how we construct the retrieval chains.

6 Python GUI Frameworks to Create Desktop, Web, and Even Mobile Apps.

Step-back prompting

Generate consistent answers that don't contradict with a broader context.

Step-back prompting is very similar to the basic RAG. Instead of asking the user an initial question, we retrieve documents from the database with a broader question.

The broader question captures more contextual information than the specific question. As a result, the final LLM can give the user more helpful information that doesn't contradict a broader context.

This is often useful when the initial queries are too specific and detailed but lacking the overall view.

Step-back prompting workflow – image by the author.

Here's the code implementation for step-back prompting. Notice that we use a different, where we passed a retriever itself in the basic RAG example.

# 4. RAG Chain
# -----------------------------------------
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# 4.1 Step back prompting
step_back_prompt = """
You are an expert software engineer. 
Your task is to rephrase the given question into a more general form that is easier to answer.

# Example 1
Question: How to improve Django performance?
Output: what factors impact web app performance?

# Example 2
Question: How to optimize browser cache in Django?
Output: What are the different caching options?

Question: {question}
Output:
"""

step_back_prompt_template = ChatPromptTemplate.from_template(step_back_prompt)

retrieval_chain = step_back_prompt_template | llm | StrOutputParser() | retriever

# 4.2 RAG chain
prompt = """
Answer the question in the below context.
Your response should be comprehensive and not contradicted with the following context.
If the context is not relevant to the question, say "I don't know":
{context}

Question: {question}
"""

prompt_template = ChatPromptTemplate.from_template(prompt)

rag_chain = (
    {"context": retrieval_chain, "question": RunnablePassthrough()}
    | prompt_template
    | llm
    | StrOutputParser()
)

Step-back prompting is helpful for applications where the broader context is critical. The LLM would provide consistent answers to related questions.

HyDE (Hypothetical Document Embedding)

Generate contextually rich answers with relevant sources

HyDE is a recent and popular technique for document retrieval. The idea is to generate a document with the LLM's prior knowledge. Then, the document retrieves relevant context from the vector store.

A good use case for HyDE is when the user often uses layman's terms to describe the problem, but the information in the vector store is very technical. Furthermore, the LLM's descriptions have more keywords to retrieve relevant information.

For instance, a query like 10 ways to improve Django's performance would provide an all-rounded answer that includes the cost implications, caching, compression, etc.

HyDE document retrieval process – image by the author.

Here's the code implementation for the above diagram. This time, I've only provided the snippet that re-creates the retrieval chain with HyDE.

# 4.1 HyDE prompting
hyde_prompt = """
You're an AI language assistant. 
Your task is to generate a more broader version of the question below.
By doing so, you're helping the user with more information.
Don't explain the question. Only provide a more broader version of it.
Question: {question}
Output:
"""

hyde_prompt_template = ChatPromptTemplate.from_template(hyde_prompt)

retrieval_chain = hyde_prompt_template | llm | StrOutputParser() | retriever

3 Ways to Deploy Machine Learning Models in Production

Multi-query

Retrieve more documents overcoming the distance base similarly search and get more relevant answers.

Multi-query is a technique that helps overcome the issue with distance-based searches in vector stores. Most vector stores use cosine similarity to retrieve vectorized documents. This method works well as long as the documents have some reasonable similarities. However, the retrieval process will fall short when distance-based similarity does not exist.

In the multi-query approach, we ask the LLM to create multiple versions of the same query. For instance, a query like How to speed up Django apps will be translated to another version, "How to improve Django-based web apps' performance?" These queries together will retrieve more relevant documents from the vector store.

As an intermediary step, we must take a unique list of documents before we pass these documents to our final RAG-LLM. This is because there's a good chance that more than one query would retrieve the same documents. Passing them all with duplicates would reach the token threshold of LLM's for no good.

Multi-query retrieval workflow – image by the author.

The code implementation also has an additional function that dedupes the document. The rest is the same as of other methods.

from langchain.load import loads, dumps

def get_unique_documents(documents: list[list]) -> list:
    # Flatten list of lists, and convert each Document to string
    flattened_docs = [dumps(doc) for sublist in documents for doc in sublist]
    # Get unique documents
    unique_docs = list(set(flattened_docs))
    # Return
    return [loads(doc) for doc in unique_docs]

# 4.1 Multi-query prompting
multi_query_prompt = """
You are an AI language model assistant. 
Your task is to create five versions of the user's question to fetch documents from a vector database. 
By offering multiple perspectives on the user's question, your goal is to assist the user in overcoming some of the restrictions of distance-based similarity search. 
Give these alternative questions, each on a new line.
Question: {question}
Output:
"""

multi_query_prompt_template = ChatPromptTemplate.from_template(multi_query_prompt)

retrieval_chain = (
    multi_query_prompt_template
    | llm
    | StrOutputParser()
    | (lambda x: x.split("n"))
    | retriever.map()
    | get_unique_documents
)

5 Python Decorators I Use in Almost All My Data Science Projects

RAG-Fusion

More relevant document plays more significant role in the answer

RAG fusion is similar to multi-query on the document retrieval front. Once again, we ask the LLM to generate different versions of the initial query. We then retrieve documents individually for these versions and combine them.

However, while combining and deduping the documents, we also rank them according to their relevancy. Here's a diagrammatic representation of the RAG-fusion process.

RAG-fusion workflow – image by the author.

Instead of only deduping, we sort the documents using a ranking system. Reciprocal fusion ranking (RRF) is a clever approach to ranking documents.

If more query versions retrieved the same document as the most relevant, then RRF would rank it high. If a specific document appears only in one of the query versions and is far less similar, RRF would rank this document low. This way, we get the more relevant information and prioritize it.

def reciprocal_rank_fusion(results: list[list], k=60):
    """ Reciprocal_rank_fusion that takes multiple lists of ranked documents 
        and an optional parameter k used in the RRF formula """

    fused_scores = {}

    for docs in results:
        for rank, doc in enumerate(docs):
            doc_str = dumps(doc)
            if doc_str not in fused_scores:
                fused_scores[doc_str] = 0
            previous_score = fused_scores[doc_str]
            fused_scores[doc_str] += 1 / (rank + k)

    reranked_results = [
        (loads(doc), score)
        for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    ]

    return reranked_results

# 4.1 RAG-fusion prompting
multi_query_prompt = """
You are an AI language assistant. 
Your task is to generate 5 different versions of the user question.
By doing so, you're helping the user to overcome the limitations of distance-based similarity search.
Provide these alternative questions separated by newlines.
Question: {question}
Output:
"""

multi_query_prompt_template = ChatPromptTemplate.from_template(multi_query_prompt)

retrieval_chain = (
    multi_query_prompt_template
    | llm
    | StrOutputParser()
    | (lambda x: x.split("n"))
    | retriever.map()
    | reciprocal_rank_fusion
)

This is How I Create Dazzling Dashboards Purely in Python.

Decomposition

When answering complex problems, let the LLM break it down to pieces and construct the answer step-by-step.

There are situations where it's better not to dive straight into answering. An excellent approach to solving more complex tasks is to break the problem into pieces and answer them individually.

This isn't just an LLM technique, right?

Yes, we're trying to break the initial question into multiple sub-questions in query decomposition. Answering these sub-questions will provide bits and pieces to answer the initial query.

Query decomposition in RAG – image by the author.

As you may see in the diagram, we retrieve relevant documents for each sub-questions and answer them separately. We then pass these question-and-answer pairs to the final RAG-LLM. The LLM now has more granular information to solve a complex problem.

# 4.1 Decomposition prompting
decomposition_template = """You are an AI language assistant. 
Your task break the following question into 5 sub questions.
By doing so, you're helping the user construct the final answer progressively.
Provide these alternative questions separated by newlines. 
Original question: {question}
Output: 
"""

decomposition_prompt_template = ChatPromptTemplate.from_template(decomposition_template)

def query_and_combine(questions: list[str]) -> str:
    print(questions)
    qa_pairs = []
    for q in questions:
        r = basic_rag_chain.invoke(q)
        qa_pairs.append((q, r))

    qa_pairs_str = "n".join([f"Q: {q}nA: {a}" for q, a in qa_pairs]).strip()

    print(qa_pairs_str)

    return qa_pairs_str

retrieval_chain = (
    {"question": RunnablePassthrough()}
    | decomposition_prompt_template
    | llm
    | StrOutputParser()
    | (lambda x: x.split("n"))
    | query_and_combine
)

Final thoughts

There are many steps from a demo app to production. One inevitable step is query translation.

The problems we solve differ in complexity. The user's imperfect querying needs to be taken into account. The retrieval process's drawbacks should be addressed. These are a lot to consider.

There is no single right way to pick the best query translation technique. In real apps, you might have to combine multiple techniques to get the desired output.