Working with Embeddings: Closed versus Open Source

Author:Murphy | View: 28002 | Time: 2025-03-22 20:19:54

Working with Retrieval

Demonstration of clustering before performing semantic search | Image by author

_If you're not a member but want to read this article, see this friend link here._

Embeddings are a cornerstone of natural language processing. You can do quite a lot with embeddings, but one of the more popular uses is semantic search used in retrieval applications.

Although the entire tech community is abuzz with understanding how knowledge graph retrieval pipelines work, using standard vector retrieval isn't out of style.

You'll find multiple articles showing you how to filter out irrelevant results from semantic searches, something we'll also be focusing on here using techniques such as clustering and re-ranking.

The main focus of this article, ** though, is to compare open source and closed source embedding model**s of various sizes.

The models that will be in focus – there are many more available | Image by author

We will compare up to 9 different embedding models that are high on the MTEB leaderboard. This will give you an idea of how a large versus a small model can perform and what the costs would be as you scale.

If you've ever used OpenAI's models to generate embeddings, you've probably been curious to see if they are competitive enough.

Just as a quick recap on embeddings, if they're new to you: when we create embeddings – an array of vectors – for each text, this translates into something a computer can understand.

Specifically for semantic search, we compare the embeddings for different texts to see how much semantic similarity there is between them. This allows us to use a kind of fuzzy search with a query – i.e., searching for relationships – rather than an exact keyword match.

Semantic similarity of the query embedding to the other embeddings | Image by author

I will go through embeddings and how they work in the introduction section, especially focusing on how we calculate semantic similarity.

I always use a custom case when I write, and this time is no different. I got this idea from a consultancy owner who was asking if he could create an application that would match job descriptions to LinkedIn profiles.

If we were doing this for real, we would be using millions of user profiles, but for this piece, I have created 6,900 synthetic LinkedIn profiles.

You can find the dataset here.

This is generally an easy use case as we're not chunking up documents in a large file; the profiles will fit into one chunk each. The domain is not a difficult one, as it is easy for us to understand if a model is finding the right relationships.

But it will give you an idea of how to think about solving a similar problem.

If you want to go straight to experimenting with the different models with the LinkedIn dataset, you can scroll past the introduction.

Introduction

For this article, as we have so few profiles, we can use clustering as a kind of unsupervised classification method for the entire dataset.

See an illustration of what clustering looks like below.

Simplified illustration of 4 clusters for the LinkedIn profiles | Image by author

Clustering will also let us understand how the different models perceives connected relationships.

Depending on the model, we can then isolate the correct group before performing semantic search within the cluster.

Simplified image of matching our query to the correct cluster | Image by author

This should allow us to filter out irrelevant results, such as the model confusing product managers with product marketing managers.

You can add re-ranking with an LLM as a last step to make sure the top results wind up on top.

To make things very easy and less price-y, I have already added the embeddings for each model we'll be evaluating in this dataset, I have also created embeddings for our queries, i.e., our anonymous job descriptions.

Remember if you want to go straight to experimenting you can scroll down to the the use case, although don't skip the economics part.

Embeddings

I mentioned that embeddings are numerical representations of texts that capture their meaning, allowing computers to process and understand natural language.

With the more modern transformer models, these models can understand the entire context and thus understand several meanings of words and sentences – something that just wasn't true a few years ago.

We can actually visualize embeddings on a graph by representing them as points in geometric space. Semantic relationships between embeddings thus translate into geometric closeness.

Embeddings on a graph as points in geometric space | Image by author

Different models are built for different tasks, but most of the larger ones are generalist enough to perform various tasks, such as retrieval, clustering, and classification.

Semantic search, used in retrieval, uses this closeness on the graph to figure out where a query would match with the other embeddings, i.e., it computes the distance of the embeddings on the graph.

To calculate this similarity between embeddings in semantic search, several methods are used, but cosine similarity is the most popular.

Semantic similarity to find best match | Image by author

The model we use will directly affect the results you get from performing semantic search and this is a result of how it has been trained.

It matters what datasets, objectives, and architectures models are trained with, as it will influence how well it understand and link various texts.

Clustering, on the other hand, organizes data into groups (or clusters) where items are more similar to each other than to those in other groups. It is better at identifying and matching similarities between embeddings, allowing us to effectively isolate the group.

Clustering embeddings to group similar profiles and to visualize the query | Image by author

This process allows us to first filter out any irrelevant matches before performing semantic search and thus acts as a noise reduction tactic.

This is the idea, at least.

Not all models will be able to use clustering in the way we need them to; some will be better at it and some worse based on how they have been built.

Embedding Models

So, how do you know which model to pick? In comes the MTEB leaderboard that ranks embedding models based on their performance across various tasks.

I have picked out a few of these models that we'll test for this, from the more popular models from OpenAI to compare with a fine-tuned Mistral-7B and smaller newer models such as Mxbai from Mixedbread AI.

They have all been released in the last two years, more or less.

Our selection of models we'll use for this article – this list is not definite | Image by author

If you're new to open source models, you may be surprised to see that many open source models rank quite highly. If you're not new to trying these models, it may still be interesting to see which one did best for this task.

Look at the table below to see the size, max tokens, and the ranking for retrieval and clustering for each model.

MTEB Leaderboard metrics for various models we'll use | Image by author

Many have used Ada-002 from OpenAI, and which is at the bottom of our list with respect to all the other models. OpenAI have though released text-embed-3 in both small and large sizes that perform better and is cheaper as well.

So, you may ask yourself, why would someone use a commercial model when they can just use an open-source model that's high on the leaderboard?

Economics of Open Source Models

Using an open source model certainly sounds good and is the preferred privacy choice. Many are high on the leaderboard, but you do need to consider the economics of hosting a model versus using an API.

I looked at the cost of hosting both smaller (around 350M) and larger (7B) open source models on a GPU versus paying per token for a few popular commercial models.

Computational costs per 10k, 100k, 1m and 2.5m texts – not including storage | Image by author

The assumption here is that each text is 400 tokens, thus a 334M model will be able to process up to 75–90 texts per second on a single L4 GPU, and a 7B model will process around 30-40 texts per second with a single A100 but maybe more.

As you'll observe, using a model like text-embed-3-large or ada-002 will really add up once you start to embed millions of texts. This is not including storage.

If you're an enterprise client and you're looking into Nvidia's embedding models, such as nv-embed-v1, they offer quite a good API that you can tap into. I've used it to test a few of these models.

Using a small model though, less than 500 parameters, is certainly the most sound choice. If you can go with a smaller open source model, you should do so as you can slash your compute costs by up to 90%.

I also went ahead and calculated the processing times for smaller and larger models, if you were to host them on a single GPU.

Estimated processing times with one GPU | Image by author

Calling an API will also take time, and they have inference limits, so regardless of what choice you make, you'll have to consider the amount of time it takes to fully embed an entire dataset.

For the open source models, you can always use more GPUs to process, but it gives you an idea of how using something smaller may be more energy-efficient.

It's always good to test a few to see which model performs well for your task though as well, which is what we will do in a bit.

Quantization

As you saw above, using larger models, such as those around 7B, is still quite expensive and energy-intensive. We only calculated costs for 2.5 million embeddings, but once you start to scale further, it may be worthwhile to look into quantization.

Quantization compresses a model by using fewer bits to represent its data, which decreases the model's size. The idea is that using quantization techniques like 4-bit and 8-bit quantization will help to run larger models on hardware that would normally not be able to handle such large models.

There have been a few people who have tried to measure the performance decrease of various metrics on quantized models; I think the last one I saw argued for a 12% overall drop in performance.

I will have to write a bit about this in the future, I love to look at the economics and performance cuts of these things.

The Use Case

I don't know about you, but I like to test different models rather than just look at metrics. This gives me a sense of how smaller and larger models can interpret relationships between texts.

The dataset with the synthetic LinkedIn profiles you can find here, along with the dataset with our job descriptions that should be matched.

The Colab notebook we will work in you can find here.

Importing the Data

You need to open the notebook to follow along, but once you have done so, you should see that we're importing two datasets from Hugging Face.

# Synthetic LinkedIn profiles with the embeddings
dataset = load_dataset("ilsilfverskiold/linkedin_profiles_synthetic")
profiles = dataset['train']

# Anonymous job descriptions with embeddings
dataset = load_dataset("ilsilfverskiold/linkedin_recruitment_questions_embedded")
applications = dataset['train']

These two datasets will allow us to compare the different embedding models for the 6,900 LinkedIn profiles that have been generated.

The synthetic data is, well, synthetic, so take it with a grain of salt. It was created with Llama 3.1 and it does suffer from some great alignment where it describes profiles using words such as ‘results-driven,' ‘seasoned,' and ‘dedicated.'

The embeddings have already been added, which you'll see if you look into the ‘profiles.'

# profiles dataset
Dataset({
    features: [...,'embeddings_nv-embed-v1', 'embeddings_nv-embedqa-e5-v5', 'embeddings_bge-m3', 'embeddings_arctic-embed-l', 'embeddings_mistral-7b-v2', 'embeddings_gte-large-en-v1.5', 'embeddings_text-embedding-ada-002', 'embeddings_text-embedding-3-small', 'embeddings_voyage-3', 'embeddings_mxbai-embed-large-v1 '],
    num_rows: 6904
})

_Ps. embeddingsgte-large-en-v1.5 does not work. I tried to host it but failed to set all the embeddings for it so do not use it.

From here, you need to decide on the job description you are interested in matching to the profiles.

Look at the code below; I have picked the second application, but you can set another number.

application = applications[1] # deciding on the second application - a product marketing manager position 
application_text = application['natural_language']
print("application we're looking for: ",application_text)

Check the dataset directly on Hugging Face if that is easier.

HuggingFace dataset viewer for the job applications dataset | Image by author

From here, you can decide which embedding model you'd like to work with. I have already tested most of them, so I'll use embeddings_mxbai-embed-large-v1 for this run.

This is the 334M open-source model that ranked quite high on the leaderboard for both retrieval and clustering if you scroll up to the table I used earlier.

If you want to try a different model, you simply set another one. Look into the dataset mentioned above to see which ones you have access to.

# Get the query embeddings for an embedding model - in here we're picking mxbai-embed-large-v1
query_embedding_vector = np.array(application['embeddings_mxbai-embed-large-v1'])

embeddings_list = [np.array(emb) for emb in profiles['embeddings_mxbai-embed-large-v1 ']] # note the extra space
texts = profiles['text']

Semantic Search

We can try to perform semantic search before we try to cluster; this allows us to see how it can do before adding in anything else.

To calculate the semantic similarity between the profiles and our query – the job application – we run the code below.

# Let's first try to calculate the cosine similarity (without clustering)
def cosine_similarity(a, b):
    a = np.array(a)
    b = np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarities = []
for idx, emb in enumerate(embeddings_list):
    sim = cosine_similarity(query_embedding_vector, emb)
    similarities.append(sim)

Then, we can display the similarity score by sorting the highest on top, limiting the display to the first 30 results.

results = list(zip(range(1, len(texts) + 1), similarities, texts))
sorted_results = sorted(results, key=lambda x: x[1], reverse=True)

# Let's display the results as well
print("nSimilarity Results (sorted from highest to lowest):")
for idx, sim, text in sorted_results[:30]:  # adjust if you want to show more
    percentage = (sim + 1) / 2 * 100
    text_preview = ' '.join(text.split()[:10])
    print(f"Text {idx} similarity: {percentage:.2f}% - Preview: {text_preview}...")

The results will look like something below, but it depends on the application you chose.

Similarity Results (sorted from highest to lowest):
Text 3615 similarity: 89.59% - Preview: Product Marketing Manager | Building Go-to-Market Strategies for Growth Results-driven...
Text 6299 similarity: 89.56% - Preview: Product Marketing Manager | Driving Growth & Customer Engagement Results-driven...
Text 3232 similarity: 89.09% - Preview: Product Marketing Manager | Driving Product Growth through Data-Driven Strategies...
Text 5959 similarity: 88.90% - Preview: Product Marketing Manager | Data-Driven Growth Expert Results-driven Product Marketing...
Text 5635 similarity: 88.84% - Preview: Product Marketing Manager | Driving Growth through Data-Driven Marketing Strategies...
Text 5835 similarity: 88.74% - Preview: Product Marketing Manager | Cloud-Based SaaS Results-driven Product Marketing Manager...
Text 139 similarity: 88.66% - Preview: Product Marketing Manager | Scaling Growth through Data-Driven Strategies Experienced...
Text 6688 similarity: 88.48% - Preview: Product Marketing Manager | Driving Business Growth through Data-Driven Insights...
Text 6405 similarity: 88.27% - Preview: Product Marketing Manager | Scaling SaaS Products for Global Markets...
Text 3439 similarity: 88.11% - Preview: Product Manager | Focused on delivering innovative products that drive...
Text 5958 similarity: 88.00% - Preview: Product Manager Office | Growth Driven by Customer Centricity Highly...
Text 5183 similarity: 87.86% - Preview: Product Marketing Manager | B2B SaaS Experienced Product Marketing Manager...
Text 1329 similarity: 87.81% - Preview: Product Marketing Manager | Scaling Growth for Emerging Tech Startups...
Text 130 similarity: 87.81% - Preview: Product Marketing Manager | Growth Strategies & Launches Results-driven Product...
Text 3423 similarity: 87.78% - Preview: Product Marketing Manager | Scaling B2B SaaS Solutions Experienced Product...
Text 4234 similarity: 87.72% - Preview: Product Manager | Leading Cross-Functional Teams to Drive Business Growth...

Using the mxbai embedding model, along with many others, we can clearly see that the results will return Product Marketing Managers with Product Managers, which is something that we do not want.

Look at the 88.11% – Preview: Product Manager and 88.00% – Preview: Product Manager Office above.

Let's introduce clustering to see if it can help.

Clustering

First, we set up the clusters from the profile embeddings; here, we need to decide the amount of clusters.

I picked 10.

embeddings_array = np.array(embeddings_list)

num_clusters = 10 # you can pick another number here

kmeans = KMeans(n_clusters=num_clusters, random_state=42)
kmeans.fit(embeddings_array)
cluster_labels = kmeans.labels_

pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(embeddings_array)

Then, we need to understand what cluster the query – or job application – will fit into.

# Let's now see how query fits into the clustering
query_embedding_array = np.array(query_embedding_vector).reshape(1, -1)
reduced_query_embedding = pca.transform(query_embedding_array)

# Let's also predict which cluster the query would belong to
query_cluster_label = kmeans.predict(query_embedding_array)[0]
print(f"The query belongs to cluster {query_cluster_label}")

After this, we can visualize the clusters on a 2-dimensional graph – remember that the clusters have been flattened, so they may sit on top of each other.

Clustering the LinkedIn profiles – Image from Colab notebook

You can hover over the different embeddings to see the profiles.

Looking at the 5th cluster's embeddings – Image from Colab notebook

We can also isolate the query, our X, on the graph to see the cluster the model thinks it belongs to.

Isolating the 5th cluster with the query , X – Image from Colab notebook

We can clearly see that the model is correctly interpreting the marketing people into one cluster, including SEO specialists and growth hackers in the same cluster, while not including Office Product Manager nor Product Managers.

This is great.

From here, we can now combine both clustering and semantic search to get better results.

Remember to check out the different models and try to see which does better; you'll see that the bigger models are naturally better at being able to group similar profiles but some smaller models do quite well.

Clustering & Semantic Search

Now that we see that it is able to group the query into the right cluster, we can combine our approach.

# Let's now do semantic search but only in the correct cluster
cluster_indices = np.where(cluster_labels == query_cluster_label)[0]

cluster_embeddings = embeddings_array[cluster_indices]
cluster_texts = [texts[i] for i in cluster_indices]

similarities_in_cluster = []
for idx, emb in zip(cluster_indices, cluster_embeddings):
    sim = cosine_similarity(query_embedding_vector, emb)
    similarities_in_cluster.append((idx, sim))

similarities_in_cluster.sort(key=lambda x: x[1], reverse=True)

top_n = 40  # adjust this number if you want to display more matches
top_matches = similarities_in_cluster[:top_n]

print(f"nTop {top_n} similar texts in the same cluster as the query:")
for idx, sim in top_matches:
    percentage = (sim + 1) / 2 * 100
    text_preview = ' '.join(texts[idx].split()[:10])
    print(f"Text {idx+1} similarity: {percentage:.2f}% - Preview: {text_preview}...")

As we can see if we run the code above, the results are now giving back results without Product Manager in them, instead it gives us back Marketing Managers which is a better fit in general.

Top 40 similar texts in the same cluster as the query:
Text 3615 similarity: 89.59% - Preview: Product Marketing Manager | Building Go-to-Market Strategies for Growth Results-driven...
Text 3232 similarity: 89.09% - Preview: Product Marketing Manager | Driving Product Growth through Data-Driven Strategies...
Text 5959 similarity: 88.90% - Preview: Product Marketing Manager | Data-Driven Growth Expert Results-driven Product Marketing...
Text 5635 similarity: 88.84% - Preview: Product Marketing Manager | Driving Growth through Data-Driven Marketing Strategies...
Text 5835 similarity: 88.74% - Preview: Product Marketing Manager | Cloud-Based SaaS Results-driven Product Marketing Manager...
Text 139 similarity: 88.66% - Preview: Product Marketing Manager | Scaling Growth through Data-Driven Strategies Experienced...
Text 6688 similarity: 88.48% - Preview: Product Marketing Manager | Driving Business Growth through Data-Driven Insights...
Text 6405 similarity: 88.27% - Preview: Product Marketing Manager | Scaling SaaS Products for Global Markets...
Text 5183 similarity: 87.86% - Preview: Product Marketing Manager | B2B SaaS Experienced Product Marketing Manager...
Text 1329 similarity: 87.81% - Preview: Product Marketing Manager | Scaling Growth for Emerging Tech Startups...
Text 130 similarity: 87.81% - Preview: Product Marketing Manager | Growth Strategies & Launches Results-driven Product...
Text 3423 similarity: 87.78% - Preview: Product Marketing Manager | Scaling B2B SaaS Solutions Experienced Product...
Text 5945 similarity: 87.63% - Preview: Marketing Manager | Driving Growth through Data-Driven Strategies Results-driven marketing...
Text 2664 similarity: 87.59% - Preview: Product Marketing Manager | Driving Growth & Innovation Results-driven Product...
Text 3368 similarity: 87.54% - Preview: Product Marketing Manager | Scaling Growth through Data-Driven Strategies Highly...
Text 5794 similarity: 87.48% - Preview: Product Marketing Manager | Driving Growth through Data-Driven Insights Results-driven...
Text 5685 similarity: 86.71% - Preview: Performance Marketing Manager | Driving Business Growth through Data-Driven Strategies...
Text 5818 similarity: 86.37% - Preview: Digital Marketing Manager | Driving Business Growth through Data-Driven Strategies...

For a real case, you'd ideally want to filter and do classification on this dataset before performing semantic search.

The idea here is for you to compare the different models, especially smaller ones to the bigger ones, to see how much quality you are willing to sacrifice for faster and cheaper inference.

Don't go for a bigger model just because, unless you really need it.

If you want to continue to evaluate the models, you can use RAGAs to evaluate how the retrieval application would do based on the different models.

Notes on Model Performance

I needed to pick something here to evaluate performance, so I chose to look at how good the models did at being able to separate product managers and product marketing managers.

The bigger models have more of an ability to get you the correct results before clustering, but all of them had issues at first to separate the two.

Ada-002, possibly being a lot bigger, did well at performing before clustering, whereas OpenAI's smaller and newer model, text-embed-3-small, did worse.

Setting up our own performance metrics for a job profile | Image by author

However, some of the models were also struggling to cluster the profiles correctly as well. Specifically, the fine-tuned 7B Mistral model and E5 did not do well here. This could be a natural consequence of how they were built.

The rest did about the same, for this specific job profile.

I was surprised at how well mxbai performed, being only 335M in size; this goes to show that the bigger models may be overkill for simpler tasks.

This is only an evaluation for this small thing; I suggest you look at other things to evaluate performance for your task.

Nevertheless, we can continue from here and also add on strategies such as re-ranking to give the best results to an LLM to evaluate.

Re-Ranking

There are many strategies to correct for irrelevant results with RAG pipelines; re-ranking is one.

Re-ranking basically means to re-rank the results so the more relevant ones will be on top. Strategies to achieve this can be to use Pairwise Ranking.

To do this, you give a pair to a model, could be an LLM, and ask it to rank the usefulness of two profiles based on the job description.

Pairwise Ranking with an LLM – simplified | Image by author

You'll have to combine methods for your use case to enable it to perform well.

If you are new to embeddings, I hope you learned something, and if it's not new, then I hope you got a bit of intel about the economics of using smaller versus larger embedding models, be they open source or commercial.

For the larger LLMs, many closed-source models are taking the lead; this is not true when it comes to embedding models.

Something to take with you is to give a smaller, more computationally efficient model a chance.

❤

Tags: Editors Pick Embedding Machine Learning Programming Rags