NLP Illustrated, Part 1: Text Encoding

Author:Murphy | View: 28817 | Time: 2025-03-22 19:39:25

Welcome back to the corner of the internet where we take complex-sounding machine learning concepts and illustrate our way through them – only to discover they're not that complicated after all!

Today, we're kicking off a new series on Natural Language Processing (NLP). This is exciting because NLP is the backbone of all the fancy Large Language Models (LLMs) we see everywhere – think Claude, GPT, and Llama.

In simple terms, NLP helps machines make sense of human language – whether that means understanding it, analyzing it, or even generating it.

If you've been following along our Deep Learning journey, we've learned that at their heart, neural networks operate on a simple principle: they take an input, work their mathematical magic, and spit out an output.

For neural networks to do this though both the input and the output must be in a format they understand: numbers.

This rule applies whether we're working with a straightforward model…

…or a highly sophisticated one like GPT.

Now here's where it gets interesting. We interact with models like GPT using text. For instance, we might ask it: "what is the capital of India?" and the model is able to understand this text and provide a response.

But wait – didn't we just say that neural networks can't directly work with text and need numbers instead?

we can't input text into a neural network

That's exactly the challenge. We need a way to translate text into numbers so the model can work with it.

we need to convert the text into numbers before inputting it in a neural network

This is where text encoding comes in, and in this article, we'll explore some straightforward methods to handle this text-to-number translation.

One Hot Encoding

One of the simplest ways to encode text is through one-hot encoding.

Let's break it down: imagine we have a dictionary containing 10,000 words. Each word in this dictionary has a unique position.

The first word in our dictionary, "about" is at position 1 and the last word, "zoo" sits at position 10,000. Similarly, every other word has its unique position somewhere in between.

Now, let's say we want to encode the word "dogs". First, we look up its position in the dictionary…

…and find that "dogs" is at the 850th position. To represent it, we create a vector with 10,000 zeros and then set the 850th position to 1 like so:

It's like a light switch: if the word's position matches, the switch is on (1) and if it doesn't, the switch is off (0).

Now, suppose we want to encode this sentence:

Along with the word vector of "dogs", we find the word vector of "barks"…

…and "loudly":

Then to represent the full sentence, we stack these individual word vectors into a matrix, where each row corresponds to one word's vector:

This forms a sentence matrix, with rows corresponding to the words. While this is simple and intuitive, one-hot encoding comes with a big downside: inefficiency.

Each word vector is massive and mostly filled with zeros. For example, with a dictionary of 10,000 words, each vector contains 10,000 elements, with 99.99% of them being zeros. If we expand to a larger dictionary – like the Cambridge English Dictionary, which has around 170,000 words – the inefficiency becomes even more pronounced.

Now imagine encoding a sentence by stacking these 170,000-sized word vectors into a sentence matrix – it quickly becomes huge and difficult to manage. To address these issues, we turn to a more efficient approach: the Bag of Words.

Bag of Words

Bag of Words (BoW) simplifies text representation by creating a single vector for an entire sentence, rather than separate vectors for each word.

Imagine we have these four sentences we want to encode:

Brownie points if you know where this quote is from. And if you don't let's just pretend this is a normal thing people say.

The first step is to create a dictionary of all the unique words across these four sentences.

Each sentence is represented as a vector with a length equal to the number of unique words in our dictionary. And each element in the vector represents a word from the dictionary and is set to the number of times that word appears in the sentence.

For example, if we take the first sentence "onions have layers,", its vector would look like this:

"onions" appears once, "have" appears once, and "layers" appears once. So, the vector for this sentence would have 1 in those positions.

Similarly, we can encode the remaining sentences:

Let's encode one last example:

For this sentence, the words "layers" and "have" are repeated twice, so their corresponding positions in the vector will have the value 2.

Here's how we can implement BoW in Python:

from sklearn.feature_extraction.text import CountVectorizer

sentences = [
    "Onions have layers",
    "Ogres have layers",
    "You get it?",
    "We both have layers"
]

bag_of_words = CountVectorizer()
X = bag_of_words.fit_transform(sentences)

print("BoW dictionary:", bag_of_words.get_feature_names_out())
print("BoW encoding:n", X.toarray())

BoW dictionary: ['both' 'get' 'have' 'it' 'layers' 'ogres' 'onions' 'we' 'you']
BoW encoding:
 [[0 0 1 0 1 0 1 0 0]
 [0 0 1 0 1 1 0 0 0]
 [0 1 0 1 0 0 0 0 1]
 [1 0 1 0 1 0 0 1 0]]

While BoW is simple and effective for counting words, it doesn't capture the order or context of words. For example, consider the word "bark" in these two sentences:

The word "bark" in "dogs bark loudly" versus "the tree's bark" has entirely different meanings. But BoW would treat "bark" the same in both cases, missing the differences in meaning provided by the surrounding words.

Bi-grams

This is where bi-grams come in handy. They help capture more context by looking at adjacent words. Let's illustrate this with these two sentences:

Just like in the BoW approach, we start by creating a dictionary:

However, this time, in addition to individual words, we include word pairs (bi-grams). These bi-grams are formed by looking at directly adjacent words in each sentence.

For example, in the sentence "dogs bark loudly," the bi-grams would be:

And in "the tree's bark", these are the bigrams:

We add this to our dictionary to get our bi-gram dictionary:

Next, we represent each sentence as a vector. Similar to BoW, each element in this vector corresponds to a word or bi-gram from the dictionary, with the value indicating how many times that word or bi-gram appears in the sentence.

Using bi-grams allows us to retain context by capturing relationships between adjacent words. So, if one sentence contains "tree's bark" and another "dogs bark," these bi-grams will be represented differently, preserving their meanings.

Here's how we can implement bi-grams in Python:

from sklearn.feature_extraction.text import CountVectorizer

sentences = [
    "dogs bark loudly",
    "the tree's bark"
]

bigram = CountVectorizer(ngram_range=(1, 2))  #(1, 2) specifies that we want single words and bigrams
X = bigram.fit_transform(sentences)

print("Bigram dictionary:", bigram.get_feature_names_out())
print("Bigram encoding:n", X.toarray())

Bigram dictionary: ['bark' 'bark loudly' 'dogs' 'dogs bark' 'loudly' 'the' 'the tree' 'tree'
 'tree bark']
Bigram encoding:
 [[1 1 1 1 1 0 0 0 0]
 [1 0 0 0 0 1 1 1 1]]

N-grams

Just as bi-grams group two consecutive words, we can extend this concept to n-grams, where n represents the number of words grouped together. For instance, with n=3 (tri-grams), we would group three consecutive words, such as "dogs bark loudly." Similarly, with n=5, we would group five consecutive words, capturing even more context from the text.

This approach enables us to capture even richer relationships and context in text data, but it also increases the size of the dictionary and computational complexity.

TF-IDF

While Bag of Words and Bi-grams are effective for counting words and capturing basic context, they don't consider the importance or uniqueness of words in a sentence or across multiple sentences. This is where TF-IDF (Term Frequency-Inverse Document Frequency) comes in. It weighs words based on:

Term Frequency (TF): how often a word appears in a sentence
Inverse Document Frequency (IDF): how rare or unique a word is across all sentences

This weighting system makes TF-IDF useful for highlighting important words in a sentence while downplaying common ones.

To see this in action, let's apply TF-IDF to our familiar set of four sentences.

Like before, we create a dictionary of unique words across our sentences.

Term Frequency (TF)

To calculate TF of a word, we use the formula:

For instance, for the word "onions" in the first sentence…

…the TF is:

Similarly, let's calculate the TF of "both" in the first sentence:

Using this same logic, we can get the TFs of all the words in the dictionary across all four sentences like so:

Note that the TF of a word can vary across different sentences. For example, the word "both" doesn't appear in the first three sentences, so its TF for those sentences is 0. However, in the last sentence, where it appears once out of four total words, its TF is 1/4.

Inverse Document Frequency (IDF)

Next, we calculate IDF for each word. IDF gives a higher value to words that appear in fewer sentences, thus emphasizing words that appear in fewer sentences.

For example, we see the word "both" appears in only one of the four sentences:

So its IDF is:

Similarly, we can get IDF for the rest of the words in the dictionary:

Here, the word "both" appears only in sentence 4, giving it a higher IDF score compared to common words like "have," which appears in multiple sentences.

Unlike TF, the IDF of a word remains consistent across all sentences.

TFIDF

The final TF-IDF score for a word is the product of its TF and IDF:

This results in sentence vectors where each word's score reflects both its importance within the sentence (TF) and its uniqueness across all sentences (IDF).

Plugging in TF and IDF terms in our formula, we get our final TF-IDF sentence vectors:

Here's how we calculate TF-IDF in Python:

from sklearn.feature_extraction.text import TfidfVectorizer

sentences = [
    "Onions have layers",
    "Ogres have layers",
    "You get it?",
    "We both have layers"
]

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(sentences)

print("TF-IDF dictionary:", tfidf.get_feature_names_out())
print("TF-IDF encoding:n", X.toarray())

Note: The Python results might differ slightly from manual calculations because of:

1. L2 Normalization: Scikit-learn's TfidfVectorizer normalizes vectors to unit length by default. 2. Adjusted IDF Formula: The IDF calculation includes a smoothing term to prevent division by zero for words that appear in all sentences

Read more about about this here.

While the methods we've discussed are essential building blocks in NLP, they come with significant limitations.

1 – these methods lack semantic understanding. They fail to grasp the meaning of words and identify relationships between synonyms like "fast" and "quick." While bi-grams can provide some local context, they still miss deeper connections and subtle nuances in meaning.

2 – these approaches rely on rigid representations, treating words as isolated entities. For example, we intuitively understand that "king" and "queen" are related, but these methods represent "king" and "queen" as being just as unrelated as "king" and "apple," completely ignoring their similarities.

3 – they face scalability challenges. They depend on sparse, high-dimensional vectors, which grow more unwieldy and inefficient as the dictionary size increases.

What if we could represent words in a way that captures their meanings, similarities, and relationships? That's exactly what word embeddings aim to do. Word embeddings revolutionize text encoding by creating dense, meaningful vectors that retain both context and semantic relationships.

In the next article, NLP Illustrated, Part 2: Word Embeddings, we'll explore how these embeddings go beyond basic word counts to capture the complex, nuanced relationships between words!

NLP Illustrated, Part 2: Word Embeddings

Connect with me on LinkedIn or shoot me an email at [email protected] if you have any questions/comments!

NOTE: All illustrations are by the author unless specified otherwise

Tags: Deep Learning Getting Started Llm Machine Learning NLP