NLP Illustrated, Part 2: Word Embeddings

Author:Murphy  |  View: 27987  |  Time: 2025-03-22 19:30:59

Welcome to Part 2 of our NLP series. If you caught Part 1, you'll remember that the challenge we're tackling is translating text into numbers so that we can feed it into our machine learning models or neural networks.

NLP Illustrated, Part 1: Text Encoding

Previously, we explored some basic (and pretty naive) approaches to this, like Bag of Words and TF-IDF. While these methods get the job done, we also saw their limitations – mainly that they don't capture the deeper meaning of words or the relationships between them.

This is where word embeddings come in. They offer a smarter way to represent text as numbers, capturing not just the words themselves but also their meaning and context.

Let's break it down with a simple analogy that'll make this concept super intuitive.

Imagine we want to represent movies as numbers. Take the movie [Knives Out](https://www.imdb.com/title/tt8946378/?ref=tt_mvclose) **** as an example.

source: Wikipedia

We can represent a movie numerically by scoring it across different features, such as genres – Mystery, Action, and Romance. Each genre gets a score between -1 and 1 where: -1 means the movie doesn't fit the genre at all, 0 means it somewhat fits, and 1 means it's a perfect match.

So let's start scoring Knives Out! For Romance, it scores -0.6 – there's a faint hint of romance, but it's subtle and not a big part of the movie, so it gets a low score.

Moving on to Mystery, it's a strong 1 since the entire movie revolves around solving a whodunit. And for Action, the movie scores 0.2. While there is a brief burst of action toward the climax, it's minimal and not a focal point.

This gives us three numbers that attempt to encapsulate Knives Out based these features: Romance, Mystery, and Action.

Now let's try visualizing this.

If we plot Knives Out on just the Romance scale, it would be a single point at -0.6 on the x-axis:

Now let's add a second dimension for Mystery. We'll plot it on a 2D plane, with Romance (-0.6) on the x-axis and Mystery (1.0) on the y-axis.

Finally, let's add Action as a third dimension. It's harder to visualize, but we can imagine a 3D space where the z-axis represents Action (0.3):

This vector (-0.6, 1, 0.3) is what we call a movie embedding of Knives Out.

Now let's take another movie as an example: [Love Actually](https://www.imdb.com/title/tt0314331/?ref=tt_mvclose).

source: Wikipedia

Using the same three features – Romance, Mystery, and Action – we can create a movie embedding for it like so:

And we can plot this on our movie embeddings chart to see how Love Actually compares to Knives Out.

From the graph, it's obvious that Knives Out and Love Actually **** are super different. But what if we want to back this observation with some numbers? Is there a way to math-ify this intuition?

Luckily, there is!

Enter cosine similarity. When working with vectors, a common way to measure how similar two vectors are is by using cosine similarity. Simply put, it calculates the similarity by measuring the cosine of the angle between two vectors. The formula looks like this:

Here, A and B are the two vectors we're comparing. A⋅B is the dot product of the two vectors, and ∥A∥ and ∥B∥ are their magnitudes (lengths).

Cosine similarity gives a result between -1 and 1, where:

  • 1 means the vectors are identical (maximum similarity)
  • -1 means they are completely opposite (no similarity)

From the graph, we've already observed that Knives Out and Love Actually seem very different. Now, let's quantify that difference using cosine similarity. Here, vector A represents the embedding for Knives Out and vector B represents the embedding for Love Actually.

Plugging the values into the cosine similarity formula, we get:

And this result of -0.886 (very close to -1) confirms that Knives Out and Love Actually are highly dissimilar. Pretty cool!

Let's test this further by comparing two movies that are extremely similar. The closest match to Knives Out is likely its sequel, The Glass Onion.

source: Wikipedia

Here's the movie embedding for The Glass Onion:

The embedding is slightly different from Knives Out. The Glass Onion **** scores a little higher in the Action category than in its predecessor, reflecting the increased action sequences of the sequel.

Now, let's calculate the cosine similarity between the two movie embeddings:

And voilà – almost a perfect match! This tells us that Knives Out and The Glass Onion **** are extremely similar, just as we expected.

This movie embedding is a great start, but it's far from perfect because we know movies are much more complex than just three features.

But we could make the embedding better by expanding the features. We can then capture significantly more nuance and detail about each film. For example, along with Romance, Mystery, and Action, we could include genres like Comedy, Thriller, Drama, and Horror, or even hybrids like RomCom.

Beyond genres, we could include additional data points like Rotten Tomatoes Score, IMDb Ratings, Director, Lead Actors, or metadata such as the film's release year or popularity over time. The possibilities are endless, giving us the flexibility to design features as detailed and nuanced as needed to truly represent a movie's essence.


Let's switch gears and see how this concept applies to word embeddings. With movies, we at least had a sense of what our features could be – genres, ratings, directors, and so on. But when it comes to all words, the possibilities are so vast and abstract that it's virtually impossible for us to define these features manually.

Instead, we rely on our trusted friends – the machines – to figure it out for us. We'll dive deeper into how machines create these embeddings soon, but for now, let's focus on understanding and visualizing the concept.

Each word can be represented as a set of numerical values (aka vectors) across several hidden features or dimensions. These features capture patterns such as semantic relationships, contextual usage, or other language characteristics learned by machines. These vectors are what we call word embeddings.

Tags: Artificial Intelligence Data Science Deep Learning Llm Machine Learning

Comment