The Guide to Recommender Metrics

Author:Murphy | View: 28158 | Time: 2025-03-23 12:04:25

Think of the YouTube main page, which displays videos that you might like, or Amazon which suggests you buy more of the products they sell. These are examples of recommender systems that try to show the things, you most likely want to interact with.

Let us assume that you have built a recommender system with a method of your choice as well. The question is: how do you evaluate it offline, before you put it in production and let it serve recommendations on a website?

In this article, you will learn exactly that! Additionally, I will tell you why you should be careful with these Metrics.

For a more thorough introduction, please refer to my other article which also shows you how to build a recommender system from scratch using TensorFlow.

Introduction to Embedding-Based Recommender Systems

Offline Evaluation of a Recommender

Let us find a definition for a recommender first that encompasses most systems you might design or find in the wild.

For us, a recommender is an algorithm that takes at least a user as an input and outputs an ordered list of items to recommend to this user.

Why at least a user? There could be more inputs such as the time of year that can help the model learn not to recommend Chocolate Santas in summer.

As an example, a fruit recommender R that we have built can do things like R(Alice) = [apple, orange, cherry].

Note: The user could be something else, maybe even another article. This could be relevant if you want to build a recommender for alternatives when some article in your shop is out of stock. So, your recommender takes an article as an input and outputs its alternatives. Let us stick with classical user-item recommendations, though.

What follows now is a list of offline recommender metrics that you can use to assess the quality of your recommender system. I will show you how they are defined, and which details to pay attention to when using them.

Train-test splitting

We will assume that we have done some form of train-test splitting to get meaningful metrics. Otherwise, we will only measure the ability of our recommender system to overfit. Assume that we have data like this:

Read: User A bought/watched/listened to item X (e.g. movie or song). I will go with "bought" for the rest of the article.

There are many options to split, and you should choose one or the other depending on the application.

Random split: You take all rows and randomly split them into train and test.
Temporal split: You select a threshold date and put all entries before this date into train, and the rest into test.
…

Usually, I go for a temporal split, whenever I have a date column because it behaves exactly as the model is intended to be used: we train a model on past data, and we want it to perform well in the future. But think if this makes sense in your case as well.

Alright, after we have defined some split let us proceed to the metrics!

Regression Metrics

In the best case, you don't only have transactional data à la user A bought item X, but even some kind of explicit feedback, such as a user rating (1–5 stars). In this case, you can also let your model output not only a list of items but also their predicted rating. Then you can take any regression metric you like, such as the mean squared error or the mean absolute error to measure how far your predicted ratings are from the actual ones. This is nothing new, hence we leave it at that.

Alignment Metrics for Recommender Systems From Implicit Feedback

We will assume that we do not have any explicit feedback such as a star rating, but only implicit feedback such as "user A bought item X" from now on. For more information about implicit feedback, check out at least the introduction of my other article:

Recommender Systems From Implicit Feedback Using TensorFlow Recommenders

Preparation

All the metrics that we will cover work with two ingredients:

a list of items that a user bought in the test set
the recommendation list (this is the prediction!) for the same user after training the model on the training set.

The following metrics measure how well these two lists are aligned. Let us see how in detail.

Let us assume that we have only a single user Alice for now: