Text Tiling Done Right: Building Solid Foundations For Your Personal LLM

It's like everybody's trying to get their own Large Language Model (LLM) these days, tweaking it to work on their private collection of documents. The privacy factor plays a big role here, amplifying the demand for more private GPT models. However, the journey to crafting a personal chatbot isn't straightforward and you basically have two main options to do that.
Firstly, you could build a custom question-answer dataset from scratch and use it to fine-tune your LLM. But let's face it, this isn't a feasible option for most of us due to the high costs and significant time commitment it requires. An alternative, more affordable approach involves generating context on the fly. This is done by retrieving relevant sections from your documents based on the user's query, with the help of embeddings. While there's no shortage of tutorials explaining how to do this, few highlight the critical importance of appropriately chunking, or 'tiling', your documents.
Here's why it's essential: if your document tiles aren't well cut, your context could be off, leading your Llm to provide answers that either completely miss the point or, worse, generate false information – a phenomenon often referred to as ‘hallucination' in machine learning. This is where the art of text tiling comes into play. It's all about breaking down a document into coherent, meaningful chunks that can facilitate precise, relevant context retrieval. In doing so, you're likely to improve the overall performance of your LLM, making it more adept at understanding queries and providing accurate responses.
Now, it may surprise you (just like it surprised me) to know that in the world of Python programming, there aren't many options for text tiling. Our primary tool available is nltk.tokenize.texttiling, which is not even very well documented. Realizing this lack of variety and the potential for improvement, I've decided to embark on developing my own text-tiling model, leveraging the revolutionary technologies offered by Natural Language Processing (NLP) and transformers.
Evaluation Mechanism for Text Tiling Models
Whenever I embark on developing a new model, I always try and begin with the end in mind, then work backwards from there. In this case, our "end" is assessing the output of our model. Without a means of evaluation, we can't measure performance and thus can't make improvements. For this reason, creating an evaluation mechanism is essential before even attempting to develop a model. Evaluating text tiling, however, poses unique challenges because it relates to the topics found within the document(s) at hand. This presents us with two major hurdles:
- We don't have a dataset of documents along with their corresponding tiling.
- Even if we had such a dataset, it would be exceptionally difficult to utilize, given that partitioning a document by topic is highly subjective.
To navigate these issues, we'll adopt a straightforward approach: create a synthetic document. This document will be a concatenation of diverse documents, which ensures we know the exact thresholds separating the original documents. These thresholds should be identified by our model. In this article, I'll employ one document as an example (which can be found here). Still, this same methodology could be applied to assemble a multitude of documents for a comprehensive model test. The composite document is crafted from the concatenation of the following Medium articles (to their respective authors, consider this a bit of free promotion, you can thank me later