How to Measure Drift in ML Embeddings

Why monitor embeddings drift?
When ML systems are in production, you often do not immediately get the ground truth labels. The model predicts or classifies something, but you do not know how accurate it is. You must wait a bit (or a lot!) to get the labels to measure the true model quality.
Of course, nothing beats measuring the actual performance. But if it is not possible due to feedback delay, there are valuable proxies to look at. One is detecting drift in model predictions ("Is the model output looking differently from before?") and model inputs ("Is the data fed to the model different"?)
Detecting prediction and Data Drift can serve as an early warning. It is helpful to see if the model is getting new inputs it might be ill-equipped to handle. Understanding that there is a change in the environment can also help identify ways to improve the model.
What type of issues can you detect? For example, if your model is doing text classification, you might want to notice if there is a new topic, shift in sentiments, change in class balance, texts in new language, or when you start getting a lot of spam or corrupted inputs.

But how exactly to detect this change?
If you work with structured data, there are a lot of methods you can use to detect drift: from tracking the descriptive statistics of the model outputs (min-max ranges, the share of predicted classes, etc.) to more sophisticated distribution drift detection methods, from statistical tests like Kholmogorov-Smirnov to distance metrics like Wasserstein distance.
However, if you are working with NLP or LLM-powered applications, you deal with unstructured data. Often in the form of numerical representations – embeddings. How can you detect drift in these numerical representations?

This is a fairly new topic, and there are no "best practice" methods yet. To help shape the intuition about different embedding drift detection approaches, we ran a set of experiments. In this article, we will give an overview of the methods we evaluated and introduce Evidently, an open-source Python library to detect embedding drift – among other things.
Drift detection basics
The idea behind drift detection is to get an alert when "the data changes significantly." The focus is on the overall distribution.
This is different from outlier detection, when you want to detect individual data points that are different from the rest.
To measure drift, you need to decide what your reference dataset is. Often, you can compare your current production data to the validation data or some past period that you consider representative. For example, you can compare this week's data to the previous week and move the reference as you go.
This choice is specific to the use case: you need to formulate an expectation on how stable or volatile your data is and choose the reference data that adequately captures what you expect to be a "typical" distribution of the input data and model responses.
You must also choose the drift detection method and tune the alert threshold. There are different ways to compare datasets between each other and the degrees of changes you consider meaningful. Sometimes you care about a tiny deviation, and sometimes only about a significant shift. To tune it, you can model your drift detection framework using historical data or, alternatively, start with some sensible default and then adjust it on the go.
Let's get to the details of "how." Say, you are working with a text classification use case and want to compare how your datasets (represented as embeddings) shift week by week. You have two embedding datasets to compare. How exactly do you measure the "difference" between them?
We'll review the five possible approaches.
Euclidean distance

You can average the embeddings in two distributions and get a representative embedding for each dataset. Then, you measure the Euclidean distance between them. This way, you compare "how far" two vectors are from each other in a multi-dimensional space.
Euclidean distance is a straightforward and familiar metric: it measures the length of the line connecting the two embeddings. There are also other distance metrics that you can use, such as Cosine, Manhattan, or Chebyshev distance.
As a drift score, you will receive a number that can go from 0 (for identical embeddings) to infinity. The higher the value, the further apart the two distributions.
This behavior is intuitive, but one possible downside is that Euclidean distance is an absolute measure. This makes setting a specific drift alert threshold harder: the definition of "far" will vary based on the use case and the embedding model used. You need to tune the threshold individually for different models you monitor.
Cosine distance

Cosine distance is another popular distance measure. Instead of measuring the "length," it calculates the cosine of the angle between vectors.
Cosine similarity is widely used in machine learning in tasks like search, information retrieval, or recommendation systems. To measure distance, you must subtract the cosine from 1.
Cosine distance = 1 – Cosine similarity
If the two distributions are the same, the Cosine similarity will be 1, and the distance will be 0. The distance can take values from 0 to 2.
In our experiments, we found that the threshold might be not very intuitive to tune, since it can take values as low as 0.001 for a change that you already want to detect. Choose the threshold wisely! Another downside is that it does not work if you apply dimensionality reduction methods like PCA, leading to unpredictable results.
Model-based drift detection

The classifier-based drift detection method follows a different idea. Instead of measuring the distance between the average embeddings, you can train a classification model that tries to identify to which distribution each embedding belongs.
If the model can confidently predict from which distribution the specific embedding comes, it is likely that the two datasets are sufficiently different.
You can measure the ROC AUC score of the classifier model computed on a validation dataset as a drift score and set a threshold accordingly. A score above 0.5 shows at least some predictive power, and a score of 1 corresponds to "absolute drift" when the model can always identify to which distribution the data belongs.
You can read more about the method in the paper "Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift."
Based on our experiments, this method is an excellent default. It works consistently for different datasets and embedding models we tested, both with and without PCA. It also has an intuitive threshold that any data scientist is familiar with.
Share of drifted components

The idea behind this method is to treat embeddings as structured tabular data and apply numerical drift detection methods – the same you would use to detect drift in numerical features. The individual components of each embedding are treated as "columns" in a structured dataset.
Of course, unlike numerical features, these "columns" do not have interpretable meaning. They are some coordinates of the input vector. However, you can still measure how many of these coordinates drift. If many do, there is likely a meaningful change in the data.
To apply this method, you first must compute the drift in each component. In our experiments, we used Wasserstein (Earth-Mover) distance with the 0.1 threshold. The intuition behind this metric is that when you set the threshold to 0.1, you will notice changes in the size of the "0.1 standard deviations" of a given value.
Then, you can measure the overall share of drifting components. For example, if your vector length is 400, you can set the threshold to 20%. If over 80 components drift, you will get a drift detection alert.
The benefit of this method is that you can measure drift on a scale of 0 to 1. You can also reuse familiar techniques that you might be using to detect drift on tabular data. (There are different methods like K-L divergence or various statistical tests).
However, for some, this might be a limitation: you have a lot of parameters to set. You can tweak the underlying drift detection method, its threshold, and the share of drifting components to react to.
All in all, we believe this method has its merits: it is also consistent with different embedding models and has a reasonable computation speed.
Maximum mean discrepancy (MMD)

You can use MMD to measure the multi-dimensional distance between the means of the vectors. The goal is to distinguish between two probability distributions p and q based on the mean embeddings µp and µq of the distributions in the reproducing kernel Hilbert space F.
Formally:

Here is another way to represent MMD, where K is a reproducing kernel Hilbert space:

You can think of K as some measure of closeness. The more similar the objects, the larger this value. If the two distributions are the same, MMD should be 0. If the two distributions are different, MMD will increase.

You can read more in the paper "A Kernel Method for the Two-Sample Problem."
You can use the MMD measure as a drift score. The downside of this approach is that many are not familiar with the method, and the threshold is non-interpretable. The computation is typically longer than with other methods. We recommend using it if you have a reason to and a solid understanding of the math behind it.
Which method to choose?
To shape some intuition behind the methods, we ran a set of experiments by introducing an artificial shift to three datasets and estimating the drift results as we increased it. We also tested the computation speed.
You can find all the code and experiment details in a separate blog.
Here are our suggestions:
- Model-based drift detection and using ROC AUC score as a drift detection threshold is an excellent default.
- Tracking the share of drifted embedding components on a scale from 0 to 1 is a close second. Just remember to tweak the thresholds if you apply dimensionality reduction.
- If you want to measure the "size" of drift in time, a metric like Euclidean distance is a good choice. However, you need to decide how you design alerting since you will deal with absolute distance values.

It's important to keep in mind that drift detection is a heuristic. It is a proxy for possible issues. You might need to experiment with the approach: not only pick the method but also tune the threshold to your data. You also must think through the choice of reference window and make an informed assumption about what change you consider meaningful. This will depend on your error tolerance, use case, and expectations on how well the model generalizes.
You can also separate drift detection from debugging. Once you get an alert on the possible embedding drift, the next step is to investigate what changed exactly. In this case, you must inevitably look back at the raw data.
If possible, you can even start with evaluating the drift in raw data in the first place. This way, you can get valuable insights: such as identifying top words that help the classifier model decide to which distribution the texts belong, or tracking interpretable text descriptors – such as length, or share of OOV words.

Evidently open-source
We implemented all the mentioned drift detection methods in Evidently, an open-source Python library to evaluate, test and monitor ML models.
You can run it in any Python environment. Simply pass a DataFrame, select which columns contain embeddings, and choose the drift detection method (or go with defaults!) You can also implement these checks as part of a pipeline, and use 100+ other checks for data and model quality.

You can explore the Getting Started tutorial to understand the Evidently capabilities or jump directly to a code example on embedding drift detection.
Conclusions
- Drift detection is a valuable tool for production ML model monitoring. It helps detect changes in the input data and model predictions. You can rely on it as a proxy indicator of the model quality and a way to alert you about potential changes.
- Drift detection is not limited to working with tabular data. You can also monitor for drift in ML embeddings – for example, when running NLP or LLM-based applications.
- In this article, we introduced 5 different methods for embedding monitoring, including Euclidean and Cosine distance, Maximum Mean Discrepancy, model-based drift detection, and tracking the share of drifted embeddings using numerical drift detection methods. All these methods are implemented in the open-source Evidently Python library.
- We recommend model-based drift detection as a sensible default approach. This is due to the ability to use ROC AUC score as an interpretable drift detection measure that is easy to tweak and work with. This method also performs consistently across different datasets and embeddings models, which makes it convenient to use in practice across different projects.
—
This article is based on the research published on the Evidently blog. Thanks to Olga Filippova for co-authoring the article.