What to Study if you Want to Master LLMs

Author:Murphy | View: 27424 | Time: 2025-03-23 11:50:58

Most of the code we use to interact with LLMs (Large Language Models) is hidden behind several APIs – and that's a good thing.

But if you are like me, and want to understand the ins and outs of these magical models, there's still hope for you. Currently, apart from the researchers working on developing and training new LLMs, there's mostly two types of people playing with these types of models:

Users, that interact via applications such as ChatGPT or Gemini.
Data scientists and developers that work with different libraries, such as llangchain, llama-index or even using Gemini or OpenAI apis, that simplify the process of building on top of these models.

The problem is – and you may have felt it – that there is a fundamental knowledge in text mining and natural language processing that is completely hidden away in consumer products or APIs. And don't take me wrong – they are great for developing cool use cases around these technologies. But, if you want to a have deeper knowledge to build complex use cases or manipulate LLMs a bit better, you'll need to check the fundamentals – particularly when the models behave as you aren't expecting.

In this blog post, I'm going to detail some of the most fundamental concepts you should check, if you want to be able to master large language models on a technical level!

Let's start!

Basic NLP /NLTK

Basic natural language processing (NLP) is the first concept you should learn. Working with the traditional NLP pipelines will be a great lesson on how computers struggle to naturally understanding written text. With NLTK (Natural Language Toolkit), you can get the first contact with manipulating text within the context of machine learning.

Poking around the NLTK library is awesome. This is one of first Python libraries to specialize in Text Mining within the open source world. It contains the most basic techniques to develop some nice prototypes, such as tokenization, stemming, lemmatization, part-of-speech tagging or named entity recognition.

With its extensive documentation and community, NLTK is a great way to start getting your hands dirty with Natural Language Processing.

Word2Vec

By working with NLTK, you will understand that advanced AI use cases are impossible to build just using classical machine learning. While you can create basic sentiment analysis or text generation pipelines, the performance of these systems will fall significantly short as you add more complexity.

So.. how did we get here? To the stage where we can have generalist tools that can crush the Turing Test?

One of the most famous papers that revolutionized the NLP industry was the word2vec paper. Although some research was already rolling before it, this paper brough Word Vectors to mainstream attention. It was an absolute game changer.

After word2vec, humans found a way to represent words mathematically, while keeping two important features:

The mathematical vectors would represent words according to their meaning and were not related to how the word was written.
These vectors where of fixed length size and would not rely on a fixed vocabulary.

How are these vectors built? Mostly, by training a neural network to predict a word within a context. The weights mapped on this neural network for a specific word translate into a mathematical relationship similar to the one below:

Similarity between Vectors in Space – Image source: https://en.wikipedia.org/wiki/Word2vec

Note: 2 dimensional simplification of word vectors

The fact that we can build mathematical representation of words that keep the semantic relationship of language was a much needed breakthrough in the NLP field. Word vectors are a method of word embeddings, one of the building blocks of Large Language Models.

Text Classification

Next stop is combining both embeddings and simple machine learning pipelines to understand how we can transform text into features to use in machine learning models.

In text classification, you normally start with logistic regression, naive bayes classifiers or tree-based models. This is where you should experiment with different tokenizers, pre-processing methods and embeddings – and where you will see performance increasing or decreasing based on your choices.

Some of the most common text classification projects you can experiment are:

E-mail spam classifiers: detecting if an e-mail is spam or not;
Sentiment analysis – checking the polarity of a specific piece of text;
Topic categorization: detecting topics of different documents;
Language detection: checking the language of a specific text;

Check out some of the following competitions on kaggle:

Sentiment analysis: https://www.kaggle.com/competitions/sentiment-analysis-on-movie-reviews
Disaster tweet categorization: https://www.kaggle.com/competitions/nlp-getting-started

Text Generation

Another area to explore is text generation. This is a very important part of Large Language Models, particularly as most of the applications that we some of level of next-word prediction.

When it comes to text generation, there are two main routes to study:

Traditional natural language processing, using words as-is and building systems that rely on conditional probabilities.
Using Neural Networks, leveraging embeddings or using models such as recurrent neural networks.

Markov chains are a great way to learn about the intution of text generation. Although they tend to simulate text that leans on repetitive patterns, they are a great way to start.

Naturally, as time goes by, you'll want to learn more about recurrent neural networks / embedding based methods that will completely improve the coherence and quality of your generated text.

Markov Chain Example – Image Source: https://en.wikipedia.org/wiki/Markov_chain

Attention and Transformers

Finally, only after studying the foundational concepts I've laid out here, should you tackle the Attention mechanism.

The Attention paper was out in 2017 and it completely transformed the NLP field. If it wasn't for the attention mechanism, the applications we see today would be completely impossible.

The attention mechanism also relies on a deep knowledge on neural networks so, while you study them, it will be very helpful to understand how this mechanism fits in on the overall neural network theory.

Based on the attention mechanism, the Transformer was born. It dethroned recurrent neural networks as the standard for text generation and understanding and helped to consolidate the attention mechanism as the standard for NLP applications. Learning attention and transformers may be challenging, but it will be much easier if you are armed with the basic NLP knowledge first

Tags: Artificial Intelligence ChatGPT Llm Naturallanguageprocessing Text Mining