CatBoost: Gradient Tree Boosting for Recommender Systems, Classification and Regression

Author:Murphy | View: 27430 | Time: 2025-03-22 22:56:41

Introduction

In today's digital world, where information overload and wide product offer is the norm, being able to help customers find what they need and like can be an important factor to make our company stand out and get ahead of the competition.

Recommender systems can enhance digital experiences facilitating the search for relevant information or products. At their core, these systems leverage data-driven algorithms to analyze user preferences, behaviors, and interactions, transforming raw data into meaningful recommendations tailored to individual tastes and preferences.

In this article, I provide a detailed explanation of how Gradient Tree Boosting works for classification, regression and recommender systems. I also introduce Catboost, a state-of-art library for Gradient Tree Boosting, and how it handles categorical features. Finally, I explain how YetiRank (a ranking loss function) works and how to implement it using CatBoost Ranker in a book recommender dataset.

As always, the code is available on Github.

Gradient Tree Boosting

The idea of boosting relies on the hypothesis that a combination of sequential weak learners can be as good or even better than a strong learner [1]. A weak learner is an algorithm whose performance is at least slightly better than a random choice and, in case of Gradient Tree Boosting, the weak learner is a Decision Tree. These weak learners in a boosting set up are trained to handle more complex observations that the previous one could not solve. In this way, the new weak learners can focus on developing themselves on more complex patterns.

AdaBoost

The first boosting algorithm with great success for binary classification was AdaBoost [2]. The weak learner in AdaBoost is a decision tree with a single split and, it works by putting more weight on observations that are more complex to classify. The new weak learner is added sequentially to focus its training on more complex patterns. The final prediction is made by majority vote from all weak learners weighted by their individual accuracy.

AdaBoost explained for binary classification

Consider this classification problem where we want to predict if a customer will churn or not based on how many days have passed since the last visit (recency), how many times a customer came to our store in the last 30 days (frequency) and how much the customer spent in the last 30 days (monetary value). Initially, each observation has the same weight (1/4) which is 1 divided by the number of observations in our dataset.

As mentioned before, AdaBoost is composed by decision trees with a maximum depth of 1. We create a decision tree for each of the features and then calculate the Gini Index of each tree. The tree with the lowest Gini Index will be our first tree. Since, in our case, all of features produce the same Gini Index, let's use Recency for the first tree.

Figure 2: Result from the first weak learner (image made by the author)

The first weak learner is able to correctly classify 3 out of 4 customers and now, we can calculate the new weight of each observation. First, we calculate the importance of this weak learner which is given by the following equation:

The importance is the sum of all weights of misclassified data which, based on our example, the Total Error is 1/4 and since we only misclassified one sample the final result is ~0.55.

Now, we can calculate the new weight for our observations based on the following equations:

Following the equation, the weight of the misclassified sample is 0.43 while the weight of well classified samples is 0.14. Since the sum of all weights must be 1, they are normalized which means 0.51 for the misclassified sample and 0.16 for well classified samples. The new weights will allow us to create buckets to build a new dataset. The weight of the misclassified sample is higher, therefore, the probability of choosing it more than once is also higher making the second weak learner focus on learning that pattern.

Finally, a new dataset is created based on a random selection of a number between 0 and 1. Suppose the 4 random numbers AdaBoost takes is 0.10, 0.70, 0.98 and 0.40 that means we will have a dataset with 1x customer 1, 2x customer 4 and 1x customer 3. The whole process is repeated until a low training error is achieved.

From AdaBoost to Gradient Tree Boosting

AdaBoost was further developed as a numerical optimization problem called Gradient Tree Boosting [3] where the objective is to minimize the loss of the model by adding, sequentially, new weak learners. The loss is minimized using a gradient descent approach which expanded the technique to support regression and multi-class classification. The loss function depends on the type of problem we want to solve. For example, for regression we can use mean absolute error (MAE) or root mean squared error (RMSE) and for classification we can use logarithmic loss (LogLoss) or cross entropy.

Gradient Tree Boosting explained for regression

Let's image this regression problem where we want to predict the price of a house based on the number of bathrooms, bedrooms, parking lots, the size of the house and a binary variable that identifies if it has air conditioning or not.

Now, let's consider a Gradient Tree Boosting algorithm with 2 estimators (weak learners) with a maximum depth of 2 instead of 1 as in AdaBoost (different maximum depths and number of estimators can be used to avoid over/under-fitting).

The first thing to do is to predict the average price of our dataset which is 301k$. Then, we calculate the residuals from the difference between the average and the actual prices.

Table 4: First step on GTB for regression

The first estimator will try to improve the error by predicting the residuals . For that, it uses area for the first split and, for the second split, air conditioning and parking lots, producing a prediction of 109k$, 399k$, 51k$ and -201k$.

The ground truth is 104k$, 49k$, -1k$ and -151k$, which means we have 5k$, 50k$, 50k$ and -50k$ residual to be predicted by the second estimator in order to reduce the difference between the actual and the predicted price. The second estimator uses for the first split #bedrooms and, for the second split, #bathrooms and area, producing a prediction of 5k$, 30k$, 50k$ and -30k$.

The final prediction will be a weighted sum of the predicted values: the average, the estimator 1 and 2 weighted by the learning rate as shown in Figure 3.

Figure 3: Gradient Tree Boosting in practice with a learning rate of 1 (image made by the author)

CatBoost

CatBoost [4] is a gradient boosting toolkit that promises to tackle target leakage present in most of the existing implementations of gradient boosting algorithms by combining ordered boosting and an innovative way of processing categorical features.

The way CatBoost avoids leakage when encoding categorical features, starts with treating the data sequentially, i.e., handling one row at time. CatBoost does not encode using an Overall Mean but it uses a defined Prior (usually 0.05 that can be set in the hyper parameters of the model) that is summed to the number of samples that have a combination of same category and positive target (OptionCount) divided by the number of previous rows with the same category (n) plus 1.

Equation 3: Formula to encode categorical values for classification

Let's use the previous classification example and let's add a categorical feature Gender to encode using CatBoost logic. So, in our case, the first row will have the encoded value 0.05 because we do not have a previously combination M-Churn and it is the first row with Gender = M, the same happens for the second row. For the third row, we have seen a combination F-Churn, therefore, the encoded value is (1+0.05)/(1+1) = 0.525. For the last row, since none of the Gender = M has Churn = 1, its encoded value is (0+0.05)/(1+1) = 0.025.

Table 5: Encoding Categorical Features for Classification using CatBoost

When it comes to regression, the process is similar but, rather than counting the number of times the combination positive label-category appeared before, it sums the target value and calculates a kind of moving average for that specific category.

Equation 4: Formula to encode categorical values for regression

Let's use the previous regression example and let's encode air conditioning feature. The first row when the value is 1 will be encoded as 301.25, since the Sum Target for that label is 0 and if we assume that the Prior is equal to the average of the target value in our training dataset. For the second row, the same happens. However, for the third row the Sum Target is 350k and n is also 1 which means that the encoded value will be (350+301.25)/(1+1) = 325.63k. Following the same logic, the encoded value for the forth row is 317.08k.

Table 6: Encoding Categorical Features for Regression using CatBoost

CatBoost Ranker

I have mentioned previously that the loss function for CatBoost Classifier can be LogLoss or cross entropy and for CatBoost Regression can be RMSE, MAE or even Quantile. However, when it comes to CatBoost Ranker, the loss function must be a ranking function like YetiRank [5].

The equation of YetiRank is the following:

, where index (i, j) goes over all pairs of documents retrieved for a query, wij is the weight of a given pair of documents, xi and xj are predictions of the ranking function for the query and documents i and j. This gradient boosting optimization problem tries to minimize the difference between the actual vector of documents and the predicted one.

The weight wij has the most important role on ranking the documents. It is calculated through 100 random permutations of the scores according to the following equation:

, where ri is a random variable uniformly distributed in [0, 1]. The documents are re-ranked according to the new scores and a weight of every pair of consecutive documents is calculated using the formula 1/R, where R is the rank of the highest ranked document of the two.

After all 100 trials are complete, each pair of documents that was observed occurring consecutively, in at least one trial, will have a non-zero accumulated weight, all of the rest of the pairs of documents will have a weight of 0.

Considering the example in Table 6, we have a new score for each book after the 1st Permutation. Those scores changed the rankings, making the book 8 in position 1, book 23 in position 2, book 2 in position 3 and book 14 kept at position 4. Now, we calculate the weight for permutation 1 which will be 1/1 for pair 8–23 and 1/3 for pair 2–14.

Following the same logic for permutation 100 that kept the original ranking the weight for pair 2–23 will be 1/1 and 1/3 for pair 8–14.

After all permutations, the accumulated weight of each pair is the sum of the weights from all permutations. If a pair was not placed consecutively in any permutation, then it will have 0 weight.

Recommender Systems with CatBoost Ranker

In this section, I will implement CatBoost Ranker on a public dataset from kaggle under the License CC0: Public Domain. This dataset contains information about the user, the book and the rating given by the user to the book. We will try to rank the books based on their relevance for each user. After that, we use nDCG to compare our ranking with the actual ranking which is given by sorting the books based on the rating given by the user.

The final dataset is composed by:

User-ID – identifies a user
Location – comma separated string with city, state and country of the user
Age – user's age
ISBN – book identifiers
Book-Rating – user's rating for a specific book
Book-Title – title of the book
Book-Author – author of the book
Year-of-Publication – year when the book was published
Publisher – Editor that published the book

We start by importing the libraries:

%matplotlib inline
import pandas as pd
from catboost import CatBoostRanker, Pool
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import random
from sklearn.metrics import ndcg_score
from category_encoders.cat_boost import CatBoostEncoder
from sklearn.decomposition import PCA
from sentence_transformers import SentenceTransformer

Then, we load the three datasets and we merge them into a single dataset:

books_df = pd.read_csv('data/Books.csv')
users_df = pd.read_csv('data/Users.csv')
ratings_df = pd.read_csv('data/Ratings.csv')

df = pd.merge(users_df, ratings_df, on='User-ID', how='left')
df = pd.merge(df, books_df, on='ISBN', how='left')

We perform some exploratory data analysis to identify issues with the data:

Remove observations where the user did not read the book.

df = df[df['Book-Rating'].notnull()]

Check missing values and replace missing Book-Author and Publisher with unknown category.

print(df.columns[df.isna().any()].tolist())

df['Book-Author'] = df['Book-Author'].fillna('unknown')
df['Publisher'] = df['Publisher'].fillna('unknown')

Remove observations with missing information about the book.

df = df[df['Book-Title'].notnull()]

Replace non integer Year-of-Publication with null values.

df['Year-Of-Publication'] = pd.to_numeric(df['Year-Of-Publication'], errors='coerce')

Check Age, Year-of-Publication and Book-Rating distributions to identify anomalies.

plt.rcParams["figure.figsize"] = (20,3)
sns.histplot(data=df, x='Age')
plt.title('Age Distribution')
plt.show()

sns.histplot(data=df, x='Year-Of-Publication')
plt.title('Year-Of-Publication Distribution')
plt.show()

sns.histplot(data=df, x='Book-Rating')
plt.title('Book-Rating Distribution')
plt.show()

Figure 4: Data Distribution (image made by the author)

Clean up data by:

Replacing age > 100 (which seems an error) with missing values to be handled by CatBoost.
Replacing Year-of-Publication <= 0 with median value and limiting the upper bound to 2021 because it was when the dataset was published in kaggle.
Removing observations with rating = 0, since those books were read by the user but not rated.
Creating 3 new features (city , state, country) from Location. Since, city is too noisy, we will not use it.

df['Age'] = np.where(df['Age'] > 100, None, df['Age'])

df['Year-Of-Publication'] = np.where(df['Year-Of-Publication'] <= 0, np.nanmedian(df['Year-Of-Publication']), df['Year-Of-Publication']).clip(0, 2021).astype(str)

df = df[df['Book-Rating'] > 0]

df['city'] = df['Location'].apply(lambda x: x.split(',')[0].strip()) # too noisy, we will not use
df['state'] = df['Location'].apply(lambda x: x.split(',')[1].strip())
df['country'] = df['Location'].apply(lambda x: x.split(',')[2].strip())

With the dataset cleaned, we split the data into train, validation and test by selecting randomly 70% of the users for training, 10% for validation and 20% for test.

# create list with unique users
users = df['User-ID'].unique()

# shuffle list
random.shuffle(users)

# create list of users to train, to validate and to test
train_users = users[:int(0.7*len(users))]
val_users = users[int(0.7*len(users)):int(0.8*len(users))]
test_users = users[int(0.8*len(users)):]

# train, val and test df
train_df = df[df['User-ID'].isin(train_users)]
val_df = df[df['User-ID'].isin(val_users)]
test_df = df[df['User-ID'].isin(test_users)]

After that, we apply some pre-processing:

We create embeddings using a multilingual encoder for the textual feature Book-Title and we reduce the dimensionality using PCA with 80% of variance explained. We use a multilingual encoder because the title is written in different languages.

# model to create embeddings
model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

# train, val and test embeddings
training_books = train_df.loc[:, ["ISBN", "Book-Title"]].drop_duplicates()
val_books = val_df.loc[:, ["ISBN", "Book-Title"]].drop_duplicates()
test_books = test_df.loc[:, ["ISBN", "Book-Title"]].drop_duplicates()

train_embeddings = model.encode(training_books["Book-Title"].tolist(), normalize_embeddings=True)
val_embeddings = model.encode(val_books["Book-Title"].tolist(), normalize_embeddings=True)
test_embeddings = model.encode(test_books["Book-Title"].tolist(), normalize_embeddings=True)

# reduce dimensionality with PCA
pca = PCA(n_components=0.8, random_state=42)
train_embeddings = pca.fit_transform(train_embeddings)
val_embeddings = pca.transform(val_embeddings)
test_embeddings = pca.transform(test_embeddings)

# join embeddings to dfs
train_embeddings_df = pd.DataFrame(train_embeddings)
train_embeddings_df.columns = [f"Book-Title_{i}" for i in train_embeddings_df.columns]
training_books = pd.merge(training_books.reset_index(drop=True), train_embeddings_df, left_index=True, right_index=True)
train_df = pd.merge(train_df, training_books, on=["ISBN", "Book-Title"])

val_embeddings_df = pd.DataFrame(val_embeddings)
val_embeddings_df.columns = [f"Book-Title_{i}" for i in val_embeddings_df.columns]
val_books = pd.merge(val_books.reset_index(drop=True), val_embeddings_df, left_index=True, right_index=True)
val_df = pd.merge(val_df, val_books, on=["ISBN", "Book-Title"])

test_embeddings_df = pd.DataFrame(test_embeddings)
test_embeddings_df.columns = [f"Book-Title_{i}" for i in test_embeddings_df.columns]
test_books = pd.merge(test_books.reset_index(drop=True), test_embeddings_df, left_index=True, right_index=True)
test_df = pd.merge(test_df, test_books, on=["ISBN", "Book-Title"])

Since in CatBoost Ranker, the categorical features are encoded using One-Hot encoding, we will use CatBoostEncoder from category_encoders to force the regression encoding explained before.

CATEGORICAL_FEATURES = ['Book-Author', 'Year-Of-Publication', 'Publisher', 'state', 'country']
TARGET = ['Book-Rating']

encoder = CatBoostEncoder()
train_df[CATEGORICAL_FEATURES] = encoder.fit_transform(train_df[CATEGORICAL_FEATURES], train_df[TARGET])
val_df[CATEGORICAL_FEATURES] = encoder.transform(val_df[CATEGORICAL_FEATURES])
test_df[CATEGORICAL_FEATURES] = encoder.transform(test_df[CATEGORICAL_FEATURES])

Now, we define the CatBoost Pool for training, validation and testing datasets with:

the data we want to use together with the features
the label
the column that should be used to calculate the YetiRank score, i.e., for each user, CatBoost will calculate the best rank of books for that user.

# features
EXCLUDE_FEATURES = ['city', 'Image-URL-S', 'Image-URL-M', 'Image-URL-L', 'User-ID', 'ISBN', 'Location', 'Book-Title', 'Book-Rating']
FEATURES = [feat for feat in train_df.columns if feat not in EXCLUDE_FEATURES]

# Catboost Ranker requires sorting by groupid, i.e, same groupid should be placed consecutive in the dataset
train_df = train_df.sort_values(by='User-ID')
val_df = val_df.sort_values(by='User-ID')
test_df = test_df.sort_values(by='User-ID')

# training pool
train_pool = Pool(
    data=train_df[FEATURES],
    label=train_df[TARGET],
    group_id=train_df["User-ID"].tolist(), # YetiRank applied to User-ID as query and books as documents
)

# validation pool
val_pool = Pool(
    data=val_df[FEATURES],
    label=val_df[TARGET],
    group_id=val_df["User-ID"].tolist(), # YetiRank applied to User-ID as query and books as documents
)

# testing pool
test_pool = Pool(
    data=test_df[FEATURES],
    group_id=test_df["User-ID"].tolist(), # YetiRank applied to User-ID as query and books as documents
)

Finally, it is just a matter of training, predicting and calculate the nDCG score. To calculate nDCG, I chose a customer with more than one book rated and, with different ratings for each book to enable a proper rank without ties.

# train and predict
model = CatBoostRanker(loss_function="YetiRank", verbose=100)
model.fit(train_pool, eval_set=val_pool, early_stopping_rounds=100)
test_df["score"] = model.predict(test_pool)

# score
true_relevance = np.asarray([test_df[test_df['User-ID'] == 3542][TARGET[0]].tolist()])
y_relevance = np.asarray([test_df[test_df['User-ID'] == 3542]['score'].tolist()])
ndcg_score(true_relevance, y_relevance)

The nDCG for user 3542 is 0.99553213 which is pretty good!

From the feature importance chart, we can see that state, the embeddings we created from the Book-Title and the Publisher are the ones that more impact have in ranking prediction.

feature_importance = model.get_feature_importance(data=train_pool, verbose=0)

Figure 5: CatBoost Ranker feature importance (image made by the author)

Conclusion

In conclusion, the importance of recommender systems for organizations cannot be ignored. These intelligent algorithms not only enhance user experiences by providing personalized recommendations but also empower organizations to make data-driven decisions that drive growth and innovation.

In this article, we showed how Gradient Tree Boosting works for different problems and how they can be implemented for recommender systems. This algorithm not only provide powerful recommendations but it can also be combined with interpretable solutions like SHAP values to understand why such recommendation was made.

Keep in touch: LinkedIn, Medium

References

[1] Zhiyuan He, Danchen Lin, Thomas Lau, Mike Wu. Gradient Boosting Machine: A Survey. arXiv:1908.06951, 2019.

[2] Yoav Freund, Robert E. Schapire. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of computer and system sciences 55, 119139 (1997).

[3] Friedman, Jerome H. "Greedy Function Approximation: A Gradient Boosting Machine." The Annals of Statistics 29, no. 5 (2001): 1189–1232. http://www.jstor.org/stable/2699986.

[4] Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, Andrey Gulin. CatBoost: unbiased boosting with categorical features. arXiv:1706.09516, 2017.

[5] Andrey Gulin, Igor Kuralenok, Dmitry Pavlov. Winning The Transfer Learning Track of Yahoo!'s Learning To Rank Challenge with YetiRank. JMLR: Workshop and Conference Proceedings 14 (2011) 63–76

Tags: Catboost Data Science Hands On Tutorials Machine Learning Recommendation System