How to Build Helpful RAGs with Query Routing.

A single prompt cannot handle everything, and a single data source may not be suitable for all the data.
Here's something you often see in production but not in demos:
You need more than one data source to retrieve information. More than one vector store, graph DB, or even an SQL database. And you need different prompts to handle different tasks, too.
If so, we have a problem. Given unstructured, often ambiguous, and poorly formatted user input, how do we decide which database to retrieve data from?
If, for some reason, you still think it's too easy, here's an example.
Suppose you have a tour-guiding chatbot, and one traveler asks for an optimal travel schedule between five places. Letting the LLM answer may hallucinate, as LLMs aren't good with location-based calculations.
Instead, if you store this information in a graph database, the LLM may generate a query to fetch the shortest travel path between the points. Executing this query would give the LLM the correct information and make helpful comments.
This example is complex, but production apps may need multiple vector stores. For instance, your app may be a multi-modal RAG. You may deal with different data types (text, images, audio) and use different vector stores.
I hope I've convinced you that multiple data sources and routing are crucial. This article will discuss two fundamental techniques often used to route queries.
In real-life apps, query routing is often combined with query translation techniques such as query decomposition. This post will also touch on this. Here are some of my other posts on this topic for a refresher.
Why Does Position-Based Chunking Lead to Poor Performance in RAGs?
5 Proven Query Translation Techniques To Boost Your RAG Performance
Advanced Recursive and Follow-Up Retrieval Techniques For Better RAGs
An example to start…
Before that, let's establish a fictitious example. Suppose you've built a Chatbot that answers employees' questions about administration—for instance, their salary or performance-related queries.
We need to route our queries to the HR vector store if they are about employee benefits, performance evaluations, leave policies, or any topic directly related to human resources. On the other hand, if the query pertains to salary, payroll details, expense reimbursements, or other financial matters, it should be directed to the accounts vector store.
Here's the setup to test the rest of our work.
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
def create_retriever_from_file(file_name):
data = TextLoader(file_name).load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)
splits = text_splitter.split_documents(data)
vectorstore = Chroma.from_documents(splits, embedding=OpenAIEmbeddings())
return vectorstore.as_retriever()
hr_retriever = create_retriever_from_file("HR_Docs.txt")
accounts_retriever = create_retriever_from_file("Accounts_Docs.txt")
In the above code, we create two vector stores, one for HR and the other for Finance. Since we don't work with vector stores directly but use them as retrievers, we make the function return the vector store as a retriever object. I've used a text file for this example, but this could even be a data pipeline in real apps.
A Layman's Approach for Query Routing
A simple approach for routing is keyword filtering. You could also use a pre-trained SVM to predict the correct vector store for retrieval based on the query. But you get the point, right? We try to find some words, and our existing knowledge tells us where the query should go.
Here's a code implementation.
# Define keywords for HR and Finance queries
HR_KEYWORDS = [
"benefits",
"performance",
"evaluations",
"leave",
"policies",
"human resources",
"HR",
]
ACCOUNTS_KEYWORDS = [
"salary",
"payroll",
"expense",
"reimbursements",
"finance",
"financial",
"pay",
]
# Function to route query
def route_query(query: str) -> str:
# Convert query to lowercase for case-insensitive matching
query_lower = query.lower()
# Check if any HR keywords are in the query
if any(keyword in query_lower for keyword in HR_KEYWORDS):
return hr_retriever
# Check if any Finance keywords are in the query
elif any(keyword in query_lower for keyword in ACCOUNTS_KEYWORDS):
return finance_retriever
# If no keywords are matched, return a default response
else:
return "Unknown category, please refine your query."
# Example queries
queries = [
"What are the leave policies?",
"How do I apply for performance evaluations?",
"Can I get a breakdown of my salary?",
"Where do I submit expense reimbursements?",
"Tell me about the HR benefits available.",
]
# Route each query and retrieve the response
for query in queries:
retriever = route_query(query)
response = retriever.invoke(query)[0].page_content
print(f"Query: {query}" + "n" + f"Response: {response}" + "n")
The above code gets the job done but falls short in many ways.
First, it looks for keyword matches. What if the user uses different language to express their concern? Secondly, if we're using an ML model to predict the routes, your training data must be large enough.
This is why we use more advanced techniques, such as LLM-based routing and semantic-similarity search, which we discuss in this post.
Let the LLM decide the route.
By replacing the keyword search or the ML model with an LLM, we can gain a massive advantage in the above approach.
The LLM's general knowledge is usually sufficient to direct the query to the correct retriever. It should handle differently worded queries, misspellings, and ambiguity very well.
Here's a diagram that sums it up:

One best practice to consider here is using a structured output. The structured output gives us unambiguous responses and educates the LLMs about their options.
Let's look at a code implementation.
from pydantic import BaseModel, Field
from typing import Optional, Literal
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
# Section 1: Setup LLM and Configure Structured Output
class DataSource(BaseModel):
datasource: Optional[Literal["hr", "accounts"]] = Field(
title="Organization data source",
description="Our organization bot has two data sources: HR and accounts",
)
llm = ChatOpenAI()
structured_routing_llm = llm.with_structured_output(DataSource)
# Section 2: Routing Prompt Template
routing_prompt_template = ChatPromptTemplate.from_template("""
You are good at routing questions to either accounts or HR departments.
Which is the best department to answer the following question?
If you can't determine the best department, respond with "I don't know".
question: {question}
department:
""")
routing_chain = routing_prompt_template | structured_routing_llm
# Section 3: Define Retriever Based on the Routed Department
def get_retriever(question):
datasource = routing_chain.invoke(question).datasource
hr_prompt_template = ChatPromptTemplate.from_template("""
You are a human resources professional at Fictional, Inc.
Respond to the following employee question in the context provided.
If you can't answer the question with the given context, please say so.
context: {context}
question: {question}
""")
accounts_prompt_template = ChatPromptTemplate.from_template("""
You are an accounts professional at Fictional, Inc.
Respond to the following employee question in the context provided.
If you can't answer the question with the given context, please say so.
context: {context}
question: {question}
""")
if datasource == "hr":
print("HR")
return hr_retriever, hr_prompt_template
else:
print("Accounts")
return accounts_retriever, accounts_prompt_template
# Section 4: Answer the Question Using the Appropriate Chain
def answer_the_question(question: str) -> str:
routing_output = routing_chain.invoke(question)
retriever, prompt_template = get_retriever(routing_output)
chain = (
{"question": RunnablePassthrough(), "context": retriever}
| prompt_template
| llm
| StrOutputParser()
)
return chain.invoke(question)
# Example usage
answer_the_question("How do I change my salary deposit information?")
>> 'Accounts'
>> 'You can change your salary deposit information by logging into the accounts portal and navigating to the payroll section. From there, you can enter your new bank account information and save the changes. Your salary will then be deposited into the new account each pay period.'
The above code has four sections and a usage example. The first section defines a Pydantic object to tell the LLM the needed output structure. This time, the output would be a DataSource object rather than a regular response.
Section Two is where we define the router. In the prompt, we ask the model to say "I don't know" so that it doesn't attempt to answer any random questions.
Sections three and four retrieve the correct retriever object, fetch relevant documents, and answer the user's question in the retrieved context.
Drawbacks of LLM-based logical query routing
LLM-based logical routing is robust when the user's question is ambiguous. However, we also need to address its drawbacks.
The most potent issue with using an LLM for routing is that the LLM's prior knowledge might not be helpful for niche use cases. Most publically available LLMs are trained on general knowledge. They might not understand organization-specific acronyms, proprietor software, etc.
Also, LLM's output may not be consistent. Although LLM can route ambiguous queries more effectively, it gets confused sometimes. It might also route the same query for different sources, which questions its reliability.
Semantic query routing
This approach is pretty straightforward. We have a passage that represents each data source. Using a distance-based approach, we compare the user's input and the passage and find the most similar data source.
As you might have guessed, the passage must accurately represent the data source for this to succeed. We often use the prompt as the passage against which the documents are compared.
However, the most interesting thing about semantic routing is that we can use organization-specific terms in the prompt. Thus, semantic routing can be perfect for private chatbots.
For instance, you've got proprietary software, "MySecret," which allows employees to talk about their concerns privately. The LLM-based approach has no clue what it means. But semantic similarity can route it correctly.
Here's an example workflow:

As we see in the diagram, the similarity-based prompt picker compares the question and the prompts and selects the prompt closest to the question. Depending on the chosen prompt, the vector store for retrieval is selected.
Here's the full code implementation for semantic query routing for the same scenario.
# Section 1: Defining the prompts for each data source and embedd them.
hr_template = """You're a human resources professional at Fictional, Inc.
Use the context below to answer the question that follows.
If you need more information, ask for it.
If you don't have enough information in the context to answer the question, say so.
context: {context}
question: {query}
Answer:
"""
accounts_template = """You're an accounts manager at Fictional, Inc.
Use the context below to answer the question that follows.
If you need more information, ask for it.
If you don't have enough information in the context to answer the question, say so.
context: {context}
question: {query}
Answer:
{query}"""
prompt_templates = [hr_template, accounts_template]
prompt_embeddings = openai_embeddings.embed_documents(prompt_templates)
# Section 3: Create the similarity-based prompt picker
def find_most_similar_prompt(input):
# Embed the question
query_embedding = openai_embeddings.embed_query(input["query"])
# Pick the most similar prompt
similarity = cosine_similarity([query_embedding], prompt_embeddings)[0]
best_match = prompt_templates[similarity.argmax()]
print(
"Directing to the Accounts Department"
if best_match == accounts_template
else "Directing to the HR Department"
)
# Also pick the retriever
retriever = accunts_retriever if best_match == accounts_template else hr_retriever
# Create the prompt template with the choosen prompt and retriever
prompt_template = PromptTemplate.from_template(
best_match, partial_variables={"context": retriever}
)
return prompt_template
# Section 4: Define the full RAG chain
chain = (
{"query": RunnablePassthrough()}
| RunnableLambda(find_most_similar_prompt)
| ChatOpenAI()
| StrOutputParser()
)
# Execute the chain
print(
chain.invoke(
"""
I need more budget to buy the software we need.
What should I do?
"""
)
)
>> Directing to the Accounts Department
>> As an accounts manager at Fictional, Inc., you should create a budget proposal outlining the software needed, its cost, and the potential benefits to the company. Present this proposal to the appropriate department or upper management for their review and approval. Additionally, you can also explore cost-saving options or negotiate with the software provider for a better deal.
In the above code, we define a function that does the routing. We keep a copy of the embedded prompts separately. As new user input comes in, we embed that, too, and calculate the cosine similarity between our collection of prompts. The most similar prompt and its retriever are used to create the prompt template.
Drawbacks of Semantic Routing
The main drawback of semantic routing is that the maximum token size limits us. For Open AI models, this is 8192 tokens. This isn't a problem for smaller tasks; a big organization might have many private acronyms. So, if we have more private apps like "MySecret," which we discussed, this will take up more tokens in the prompt.
Besides the token limit, larger prompts have another problem. Since we calculate the similarity between user input and the prompt, the similarity scores may not be accurate if the prompt is too large.
Also, semantic routing's ability to route complex privately held queries is questionable. The MySecret app-related queries should go to HR as it's an employee concern listener. But if someone asks why the MySecret app loads slowly, it should go to the IT team. A semantic similarity approach might fail to route such queries.
Final thoughts
A single prompt cannot handle everything, and a single data source may not be good for all the data.
RAG apps often need different vector stores and prompts. A router that routes queries to the correct vector store is also needed.
Logical and semantic routing are the two popular approaches. We've discussed them with code examples and discussed each method's drawbacks.
We can do a few things to address the drawbacks, such as combining them with query decomposition and generating multiple answers. But they are for a future post.
Thanks for reading, friend! Say Hi to me on LinkedIn, X, and Medium.