How to Build Popularity-Based Recommenders With Polars

Author:Murphy | View: 23231 | Time: 2025-03-23 18:46:27

RECOMMENDATION SYSTEM

Recommender systems are algorithms designed to provide user recommendations based on their past behavior, preferences, and interactions. Becoming integral to various industries, including e-commerce, entertainment, and advertising, recommender systems improve user experience, increase customer retention, and drive sales.

While various advanced recommender systems exist, today I want to show you one of the most straightforward – yet often difficult to beat – recommenders: the popularity-based recommender. It is an excellent baseline recommender that you should always try out in addition to a more advanced model, such as matrix factorization.

Introduction to Embedding-Based Recommender Systems

We will create two different flavors of popularity-based recommenders using polars in this article. Don't worry if you have not used the fast pandas-alternative Polars before; this article is a great place to learn it along the way. Let's start!

Initial Thoughts

Popularity-based recommenders work by suggesting the most frequently purchased products to customers. This vague idea can be turned into at least two concrete implementations:

Check which articles are bought most often across all customers. Recommend these articles to each customer.
Check which articles are bought most often per customer. Recommend these per-customer articles to their corresponding customer.

We will now show how to implement these concretely using our own custom-crated dataset.

If you want to follow along with a real-life dataset, the H&M Personalized Fashion Recommendations challenge on Kaggle provides you with an excellent example. Due to copyright reasons, I will not use this lovely dataset for this article.

The Data

First, we will create our own dataset. Make sure to install polars if you haven't done so already:

pip install polars

Then, let us create random data consisting of a (customer_id, article_id) pairs that you should interpret as "The customer with this ID bought the article with that ID.". We will use 1,000,000 customers that can buy 50,000 products.

import numpy as np

np.random.seed(0)

N_CUSTOMERS = 1_000_000
N_PRODUCTS = 50_000
N_PURCHASES_MEAN = 100 # customers buy 100 articles on average

with open("transactions.csv", "w") as file:
    file.write(f"customer_id,article_idn") # header

    for customer_id in range(N_CUSTOMERS):
        n_purchases = np.random.poisson(lam=N_PURCHASES_MEAN)
        articles = np.random.randint(low=0, high=N_PRODUCTS, size=n_purchases)
        for article_id in articles:
            file.write(f"{customer_id},{article_id}n") # transaction as a row

This medium-sized dataset has over 100,000,000 rows (transactions), an amount you could find in a business context.

The Task

We now want to build recommender systems that scan this dataset in order to recommend popular items in some sense. We will shed light on two variants of how to interpret this:

most popular across all customers
most popular per customer

Our recommenders should recommend ten articles for each customer.

Note: We will not assess the quality of the recommenders here. Drop me a message if you are interested in this topic, though, since it's worth having a separate article about this.

Most Popular Across All Customers

In this recommender, we don't even care who bought the articles – all the information we need is in the article_id column alone.

High-level, it works like this:

Load the data.
Count how often each article appears in the column article_id.
Return the ten most frequent products as the recommendation for each customer.

Familiar Pandas Version

As a gentle start, let us check out how you could do this in pandas.

import pandas as pd

data = pd.read_csv("transactions.csv", usecols=["article_id"])
purchase_counts = data["article_id"].value_counts()
most_popular_articles = purchase_counts.head(10).index.tolist()

On my machine, this takes about 31 seconds. This sounds like a little, but the dataset still has only a moderate size; things get really ugly with larger datasets. To be fair, 10 seconds are used for loading the CSV file. Using a better format, such as parquet, would decrease the loading time.

Note: I used pandas 2.0.1, which is the latest and most optimized version.

Still, to prepare yet a little bit more for the polars version, let us do the pandas version using method chaining, a technique I grew to love.

most_popular_articles = (
    pd.read_csv("transactions.csv", usecols=["article_id"])
    .squeeze() # turn the dataframe with one column into a series
    .value_counts()
    .head(10)
    .index
    .tolist()
)

This is lovely since you can read from top to bottom what is happening without the need for a lot of intermediate variables that people usually struggle to name (_df_raw → df_filtered → df_filtered_copy → … → dffinal anyone?). The run time is the same, however.

Faster Polars Version

Let us implement the same logic in polars using method chaining as well.

import polars as pl

most_popular_articles = (
    pl.read_csv("transactions.csv", columns=["article_id"])
    .get_column("article_id")
    .value_counts()
    .sort("counts", descending=True) # value_counts does not sort automatically
    .head(10)
    .get_column("article_id") # there are no indices in polars
    .to_list()
)

Things look pretty similar, except for the running time: 3 seconds instead of 31, which is impressive!

Polars is just SO much faster than pandas.

Unarguably, this is one of the main advantages of polars over pandas. Apart from that, polars also has a convenient syntax for creating complex operations that pandas does not have. We will see more of that when creating the other popularity-based recommender.

It is also important to note that pandas and polars produce the same output as expected.

Most Popular Per Customer

In contrast to our first recommender, we want to slice the dataframe per customer now and get the most popular products for each customer. This means that we need the customer_id as well as the article_id now.

We illustrate the logic using a small dataframe consisting of only ten transactions from three customers A, B, and C buying four articles 1, 2, 3, and 4. We want to get the top two articles per customer. We can achieve this using the following steps:

We start with the original dataframe.
We then group by customer_id and article_id and aggregate via a count.
We then aggregate again over the customer_id and write the article_ids in a list, just as in our last recommender. The twist is that we sort this list by the count column.

That way, we end up with precisely what we want.

A bought products 1 and 2 most frequently.
B bought products 4 and 2 most frequently. Products 4 and 1 would have been a correct solution as well, but internal orderings just happened to flush product 2 into the recommendation.
C only bought product 3, so that's all there is.

Step 3 of this procedure sounds especially difficult, but polars lets us handle this conveniently.

most_popular_articles_per_user = (
    pl.read_csv("transactions.csv")
    .group_by(["customer_id", "article_id"]) # first arrow from the picture
    .agg(pl.count())                        # first arrow from the picture
    .group_by("customer_id")                                               # second arrow
    .agg(pl.col("article_id").sort_by("count", descending=True).head(10)) # second arrow
)

By the way: This version runs for about about a minute on my machine already. I did not create a pandas version for this, and I'm definitely scared to do so and let it run. If you are brave, give it a try!

A Small Improvement

So far, some users might have less than ten recommendations, and some even have none. An easy thing to do is pad each customer's recommendations to ten articles. For example,

using random articles, or
using the most popular articles across all customers from our first popularity-based recommender.

We can implement the second version like this:

improved_recommendations = (
    most_popular_articles_per_user
    .with_columns([
        pl.col("article_id").fill_null([]).alias("personal_top_<=10"),
        pl.Series([most_popular_articles]).alias("global_top_10")
    ])
    .with_columns(
        pl.col("personal_top_<=10").list.concat(pl.col("global_top_10")).list.head(10).alias("padded_recommendations")
    )
    .select(["customer_id", "padded_recommendations"])
)

Conclusion

Popularity-based recommenders hold a significant position in the realm of recommendation systems due to their simplicity, ease of implementation, and effectiveness as an initial approach and a difficult-to-beat baseline.

In this article, we have learned how to transform the simple idea of popularity-based recommendations into code using the fabulous polars library.

The main disadvantage, especially of the personalized popularity-based recommender, is that the recommendations are not inspiring in any way. People have seen all of the recommended things before, meaning they are stuck in an extreme echo chamber.

One way to mitigate this problem to some extent is by using other approaches, such as collaborative filtering or hybrid approaches, such as here: