How I Used Clustering to Improve Chunking and Build Better RAGs
Semantic chunking crumbled when my sentences seemed worlds apart.
My first attempt at semantic chunking collapsed when distant sentences shared too much meaning. However, an agentic approach is costly and too slow in most cases. I wanted to find a sweet spot in between. I wanted it to be reasonably accurate yet cheaper.
RAG apps heavily depend on chunking strategies. Better chunks lead to better responses. There are many ways to chunk a text. The easiest and most popular is recursive character chunking, and the slightly complicated yet helpful one is semantic chunking. The more human-like approach is agentic -chunking.
If you're new, look at my prior posts about chunking.
How to Achieve Near Human-Level Performance in Chunking for RAGs
Why Does Position-Based Chunking Lead to Poor Performance in RAGs?
In this post, I'll share why I use clustering as an effective alternative for semantic and agentic chunking. Let's start by understanding how semantic chunking works.
Semantic Chunking, in short
Semantic Chunking splits the document when the semantic meaning of two subsequent sentences differ significantly.
For instance, when you talk about climate change, you may start with rising temperatures, then you could speak about the whale population, and then go to politics. A position-based chunking method like recursive character split doesn't care about the themes in your text; it splits it at a fixed token length, no matter what.
But semantic chunking splits them quite nicely. It keeps the chunk as long as you speak about one topic and changes only when you discuss a new one.
The struggle comes when you revert to an older topic. In the same example, you once again talk about the rising temperature. Semantic chunking creates a new chunk, even though there's an existing one talking about this theme.
Agentic chunking, in short
We often suggest using an agentic technique—or, more precisely, an agent-like approach—to overcome this problem.
In an agentic approach, we process the text sentence by sentence. An LLM would decide if the sentence can be in a group with similar sentences. If it can't find one, it creates one.
One necessary tweak to the sentences before you start processing is propositioning. It means converting your sentences to standalone sentences. Or, more easily, propositioning converts the "He," "She," "They," and others pronounce with their actual references.
However, as you might have guessed, we heavily depend on LLM calls to do this. If you've followed my previous post, there are at least two LLM calls—one to decide which chunk is the best match for the sentence. Another is to update the chunk summary and title, as this new sentence has been added. Sometimes, there's an additional one to create a new chunk.
This isn't good for two reasons. One is, obviously, the cost. More LLM calls mean more cost. To reduce the cost, you may try smaller models (such as GPT-4o-mini) for these tasks. But it's a decision you must make based on your specific needs.
The other unfavorable reason is latency. Unless you use a locally hosted model, network latency will take a long to chunk a decent-sized document.
Advanced Recursive and Follow-Up Retrieval Techniques For Better RAGs
The clustering approach
I wanted to use a cheap and quick technique that overcomes the issues of semantic chunking. It's a kind of middle ground between semantic and agentic chunking.
I tried using K-Means clustering on the vector versions of the sentences—sorry, propositions—and it worked!
To start with, here's a synthetic text blob to play with.
The Earth is warming, and the consequences are becoming increasingly dire. Rising temperatures are disrupting ecosystems, with the oceans being among the hardest hit. The warming seas are threatening whale populations, as their migratory patterns shift and their food sources dwindle. Entire species are at risk as the delicate balance of marine life is thrown off by the relentless heat.
On land, the effects are just as devastating. Communities are being displaced by the increasing frequency of extreme weather events and rising sea levels. These climate refugees are forced to leave their homes behind, seeking shelter in regions ill-prepared to accommodate them.
Despite the mounting evidence, political leaders remain divided on climate action. Some push for immediate change, advocating for aggressive policies to curb emissions, while others downplay the severity of the crisis, prioritizing short-term economic gains over long-term survival.
Yet, the temperature keeps rising, driving home the urgency of the situation. Each fraction of a degree brings us closer to irreversible damage, and the window to act is closing rapidly. The future depends on the choices we make now.
As we now know, the first step is to convert these sentences into self-explanatory propositions. The following lines of code can do this.
from langchain import hub
from langchain.pydantic_v1 import BaseModel
def create_propositions(paragraph: str) -> list[str]:
print(paragraph)
propositioning_prompt = hub.pull("wfh/proposal-indexing")
class Sentences(BaseModel):
sentences: list[str]
propositioning_llm = llm.with_structured_output(Sentences)
propositioning_chain = propositioning_prompt | propositioning_llm
sentences = propositioning_chain.invoke(paragraph)
return sentences.sentences
props = [prop for para in text.split("nn") for prop in create_propositions(para)]
In the above code, we use an LLM prompt to convert the sentences to propositions. While you can be creative and write your prompt, there's an excellent one hosted in the Langchain hub.
We also use a Pydantic model to extract the sentences as structured output. This is the most reliable method to extract sentences from an unstructured source.
Finally, we break the text into multiple paragraphs and pass each paragraph to our create_propositions functions. This ensures that propositions' meanings don't change in the same paragraph but can be different when they are not.
The next step is to create embeddings for our sentences. Embedding converts our text into vectors that preserve its semantic meaning. You can use many embedding models. I use the OpenAI embedding model here.
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
prop_embeddings = embeddings.embed_documents(props)
As we have the vector embeddings of our propositions, we can now create clusters of them. Once again, there are many clustering techniques. The first thing anyone would try would be K-means. It's both simple to understand and implement and fast to execute. And in most cases, K-means is a sufficient algorithm. Here, for the same reasons, I prefer using K-means.
The following code creates K-means clusters of our embeddings and a list of dictionaries to store propositions, their embeddings, and cluster values.
Python">num_clusters = 3
# Cluster the embeddings and assign a cluster to each proposition
kmeans = KMeans(n_clusters=num_clusters, random_state=0).fit(prop_embeddings)
cluster_assignments = kmeans.labels_
# Create a list of dict to store the embeddings, the text, and the cluster assignment
props_clustered = [
{"text": prop, "embeddings": emb, "cluster": cluster}
for prop, emb, cluster in zip(props, prop_embeddings, cluster_assignments)
]
# Display clusters and their propositions
for cluster in range(num_clusters):
print(f"Cluster {cluster}:")
for prop in props_clustered:
if prop["cluster"] == cluster:
print(f" - {prop['text']}")
print()
Here's the output.
Cluster 0:
- Communities are being displaced by the increasing frequency of extreme weather events and rising sea levels.
- These displaced communities are climate refugees.
- Climate refugees are forced to leave their homes behind.
- Climate refugees seek shelter in regions ill-prepared to accommodate them.
Cluster 1:
- The Earth is warming.
- There is mounting evidence supporting the need for climate action.
- The future depends on the choices humanity makes now.
Cluster 2:
- Political leaders remain divided on climate action.
- Some political leaders push for immediate change.
- These political leaders advocate for aggressive policies to curb emissions.
- Other political leaders downplay the severity of the climate crisis.
- These political leaders prioritize short-term economic gains over long-term survival.
Cluster 3:
- The oceans are among the hardest hit ecosystems.
- The warming seas are threatening whale populations.
- The migratory patterns of whale populations are shifting.
- The food sources of whale populations are dwindling.
- Entire species are at risk.
- On land, the effects are just as devastating.
These chunks are very relevant and accurate. The best part is that creating these clusters didn't take much time. If it were an agentic technique, each LLM call would have taken considerable time, even for this short text.
The first cluster discusses the impact of climate change on communities, the third discusses the political perspective, and the last discusses climate change's impact on oceans and whales.
This is impressive for a cheap and fast chunking method.
Let's create the chunks from the clusters we've created.
chunk_maker_promtp = PromptTemplate.from_template(
"""
Summerize the following text into a concise paragraph.
It should preserve any key information and statistics.
Text:{text}
"""
)
chunk_maker_chain = chunk_maker_promtp | llm | output_parser
clusters = [
[prop["text"] for prop in props_clustered if prop["cluster"] == cluster]
for cluster in range(num_clusters)
]
for i, c in enumerate(clusters):
print(f"Cluster {i}:")
print(chunk_maker_chain.invoke(c))
print()
The above code uses an LLM to summarize the clusters we created into independent chunks. These chunks are now ready to be vectorized for retrieval. They are more meaningful paragraphs that capture as much related information as possible about a given theme.
5 Proven Query Translation Techniques To Boost Your RAG Performance
Final thoughts
I wanted to find a technique between the more human-like agentic chunking and its next-best alternative, semantic chunking. Semantic chunking can split the text into meaningful segments, but agentic chunking can take it further by picking sentences with similar meanings anywhere in the text.
However, both methods are slow and costly. Agentic chunking calls an LLM for every sentence it sees, while semantic chunking calculates the distance between consecutive sentence pairs.
The clustering technique, however, only uses an embedding model, other than the need for propositioning. Hence, it gives a significant cost advantage and reduces latency.
Yet, the number of clusters is often a manual task. We can use Silhouette, or Elbow method to arrive at an optimal cluster size, but automating this part is a challenge.