Feature Importance Analysis with SHAP I Learned at Spotify (with the Help of the Avengers)

Author:Murphy | View: 26117 | Time: 2025-03-23 12:57:21

This article is one of a two-part piece documenting my learnings from my Machine Learning Thesis at Spotify. Be sure to also check out the second article on how I succeeded in significantly optimizing my model for this research.

Boosting Model Accuracy: Techniques I Learned During My Machine Learning Thesis at Spotify (+Code…

Two years ago, I conducted a fascinating research project at Spotify as part of my Master's Thesis. I learned several useful ML techniques, which I believe any Data Scientist should have in their toolkit. And today, I'm here to walk you through one of them.

During that time, I spent 6 months trying to build a prediction model and then deciphering its inner workings. My goal was to understand what made users satisfied with their music experience.

It wasn't so much about predicting whether a user was happy (or not), but rather understanding the underlying factors that contributed to their happiness (or lack thereof).

Sounds exciting, right? It was! I loved every bit of it because I learned so much about how ML can be applied in the context of music and user behavior.

(If you're interested in the applications of ML in the music industry, then I highly recommend checking out this interesting research led by Spotify's top experts. It's a must-read!)

Machine Learning & Behavioral Psychology in Tech

In tech, research projects like mine are very common because a lot of the work revolves around delivering the best personalized experience for users/customers.

This often means delving into the human psyche, and ML can be a powerful tool for achieving the impossible – understanding humans.

When we combine ML with Psychology and Behavioral Sciences, we get closer to having a complete picture of how humans behave.

How?

We build models that try to predict how people will react.

And sometimes we try to understand why the model predicted that reaction in the first place. It's like asking the model – "Hey, what reasons did you think explain why users behave this way?"

The answer lies in finding what variables of the model had the most weight in predicting the outcome. And then ** understanding their individual impac**t on the prediction result.

In my research, I did what we call a feature importance analysis and used a powerful tool called SHAP to do that.

In this post, I will explain to you:

What are Shapley values aka SHAP
Why you need to know how to use SHAP
How to use SHAP

Welcome to interpreting ML models!

The Dilemma Data Scientists Can't Escape

When you're dealing with models such as LightGBM, which is the one I was using in my research, know that you're tackling a specific type of model – a ‘black box'.

Like a villain straight out of a DC Comic, a black box model is something you should be scared of.

Why? Because deciphering those is like beating up the Joker while he's giggling at your face. Be sure he won't spill the beans easily.

In many industries, regulations, and processes require that you explain how you reached your results.

While in Tech, knowing how to explain your prediction results is important for gaining trust or understanding the inner working of your model. The latter was the case in my research.

Interpretability vs Complexity

When picking the right model for your project, you will have to think about many factors, including:

Interpretability – Can you explain how your model did the magic? How it made its decisions? Linear regression models for example are the definition of transparency. You can easily trace back the impact of each feature on the model's output. Interpretability is also often called Explainability.
Complexity – How sophisticated is the architecture or representation of your model? Neural Networks can predict wonders but understanding how they captured the relationships between the features will have you pulling your hair out. Complexity is also often associated with high predictive capability.

White or Black Box?

As you can see these two do not align at all. A complex model will most likely come at the cost of its interpretability. So you'll have to be careful in deciding which type of model you want to build based on your goals:

White Box – Models that are easily interpretable/explainable because they give you a clear image of the relationships between features. You can go behind the scenes and understand how the model made its prediction.
Black Box – Models that can yield accurate results but that are also highly complex. They are difficult to interpret. Like a black hole, we absolutely have no freaking idea what's going on inside. So let's just call them BBs, makes them sound less scary.

Black box models are often the ones we use in Tech to achieve powerful and accurate results. That's why knowing how to interpret them is important for your career as a Data Scientist.

Step 1 – Good Prediction Results

To evaluate what factors were driving satisfaction, I had to:

Choose a metric that could be used as a proxy for describing user satisfaction.
Pick the right ML model – in my case LightGBM*.
Find and create as many relevant features as possible for my model.
Pick an accuracy metric to evaluate the performance of my model – in my case ROC AUC score.
Optimize the performance of my model to make sure it was predicting this metric relatively accurately.

*LightGBM is a decision tree-based framework that combines gradient boosting and ensemble methods to tackle complex problems. Ensemble methods improve accuracy as each tree has a different knowledge of the data and applies a different decision-making approach. While gradient boosting works by building weak decision trees sequentially, each learning from where the previous model had difficulty and correcting the difference between the predicted value and the true value called ‘residuals'. These trees propagate the gradients of errors.

Due to the confidentiality of this research, I cannot share specific information, but I'll do my very best for it not to sound like something else you need to decipher. Let's leave that to SHAP.

In my research, I built a LightGBM classifier aka a BB model, that output a binary outcome: y = 1 → the user is seemingly satisfied y = 0 → not so much

My goal was to understand why listeners were feeling how they were feeling, rather than simply figuring out how they were feeling.

At first, my ROC AUC score was around 0.5, which is the worst possible score you can get on a classifier. It means the algorithm has a 50% chance of predicting yes or no. This is as random as what humans can do.

So after spending 2 months trying to improve the prediction of my model, I was finally able to reach a satisfying result.

Again, be sure to check out how I did that in the article below!

Boosting Model Accuracy: Techniques I Learned During My Machine Learning Thesis at Spotify (+Code…

Only then, I could finally start digging into how my model made its predictions using feature importance. So let's dive right in!

Step 2 – Feature Importance

The SH-Avengers to the Rescue

So let's rewind.

In my research, I wanted to evaluate what were the key factors driving user satisfaction. Since my model was a BB, the only way I could do that was by doing a feature importance analysis.

Picture this – the Avengers fighting together to save the world. How would you actually know which one was the most powerful in the rescue? (We all know it's Iron Man, but let's just pretend we don't!)

This is what SHapley Additive exPlanations, or SHAP values, do in the imaginary world of Machine Learning.

What are SHAP values?

Shap or Shapley values are based on cooperative game theory.

They measure how each feature – the Avengers – contributed to the model's final decision – saving the world or destroying it.

In the context of feature importance, the Avengers are the individual features, and the rescue of the world is the prediction made by the model.

How does it work?

The idea is to consider all possible combinations of features and measure the change in the model's prediction when a specific feature is included or excluded.

By comparing these different combinations, Shapley values assign a value to each feature that represents its contribution to the prediction.

The key mathematical concept at play is permutation. We consider different permutations of features and calculate their marginal contributions. By considering all possible permutations and averaging the contributions, we arrive at the Shapley value for each feature.

By doing so, we get to understand how our model is ‘saving the day' – or making its predictions!

To be more specific, SHAP values help us understand:

What were the top features that influenced the prediction results? In our case how powerful each Avenger was in saving the world. Imagine the fist of Hulk into your face, that thing will send you flying all the way into the multiverse.
_How did these different features affect the prediction results?_Aka the impact of the power of each Avenger in saving the world. I mean, see how Wanda, the Scarlet Witch, was so strong, but so strong in: destroying half the world!!! Sometimes the hero is a villain (or just a bad hero), so it's important that you find out on which side they are.

Double Interpretability

Shapley values are particularly powerful because they provide a double picture of the interpretability of BB models:

Globally. By giving a general overview of the predictive power of each feature, aggregated across all users.
Locally. SHAP values can be calculated for each user to explore how features influenced the prediction result for this specific user.

How about Joining the Avengers? Let's!

Keep in mind all the data used below are pure examples to preserve the confidentiality of this research.

1. Encode your variables

Make sure your variables are encoded:

Ordinal features, so that the model preserves the ordinal information
Categorical features, so that the model can interpret nominal data

So first, let's store our variables somewhere. Again, because the research is confidential, I cannot disclose the data I used, so let's use these instead:

region = ['APAC', 'EU', 'NORTHAM', 'MENA', 'AFRICA']
user_type = ['premium', 'free']

ordinal_list = ['region', 'user_type']

Then, make sure to build the function that encodes the variables:

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

def var_encoding(X, cols, ordinal_list, encoding):

    #Function to encode ordinal variables
    if encoding == 'ordinal_ordered': 
        encoder = OrdinalEncoder(categories=ordinal_list) 
        encoder.fit(X.loc[:, cols])
        X.loc[:, cols] = encoder.transform(X.loc[:, cols])

    #Function to encode categorical variables    
    elif encoding == 'ordinal_unordered':
        encoder = OrdinalEncoder()
        encoder.fit(X.loc[:, cols])
        X.loc[:, cols] = encoder.transform(X.loc[:, cols])

    else:     
        encoder = OneHotEncoder(handle_unknown='ignore')
        encoder.fit(df.loc[:, cols])
        df.loc[:, cols] = encoder.transform(df.loc[:, cols])

    return X

Then apply that function to your list of variables. This means you need to create lists with strings of the name of your variables, i.e. a list for your ordinal variables, one for the categorical ones, and one for the numerical ones.

def encoding_vars(X, ordinal_cols, ordinal_list, preprocessing_categoricals=False):

    #Encode ordinal variables
    df = var_encoding(df, ordinal_cols, ordinal_list, 'ordinal_ordered')

    #Encode categorical variables
    if preprocessing_categoricals: 
       df = var_encoding(df, categorical_cols, 'ordinal_unordered')    

    #Else set your categorical variables as 'category' if needed
    else:
        for cat in categorical_cols: 
            X[cat] = X[cat].astype('category')

    #Rename your variables as such if needed to keep track of the order
    #An encoded feature such as region will no longer show female or male, but 0 or 1
    df.rename(columns={'user_type': 'free_0_premium_1'},   
    df.reset_index(drop=True, inplace=True)   

    return df

2. Prepare your data

Split your dataframe to get your train, validation, and test sets:

Train Set – to train the model on the algorithm you pick eg. LightGBM
Validation Set – to hyper-tune your parameters and optimize your prediction results
Test Set – to make your final predictions

In my research, I used GroupShuffleSplit. It creates a user-defined number of independent training-validation splits. It works by randomly assigning entire groups to either the training or validation sets.

def split_df(df, ordinal_cols, ordinal_list, target):

    #Splitting train and test
    splitter = GroupShuffleSplit(test_size=.13, n_splits=2, random_state=7)

    #If you're dealing with many rows belonging to the same id then make sure to split based on the same id
    split = splitter.split(df, groups=df['user_id'])
    train_inds, test_inds = next(split)

    train = df.iloc[train_inds]
    test = df.iloc[test_inds]

    #Splitting val and test data
    splitter2 = GroupShuffleSplit(test_size=.5, n_splits=2, random_state=7)
    split = splitter2.split(test, groups=test['user_id'])
    val_inds, test_inds = next(split)

    val = test.iloc[val_inds]
    test = test.iloc[test_inds]

    #Defining your X (predictive features) and y (target_feature)
    X_train = train.drop(['target_feature'], axis=1)
    y_train = train.target_feature

    X_val = val.drop(['target_feature'], axis=1)
    y_val = val.target_feature

    X_test = test.drop(['target_feature'], axis=1)
    y_test = test.target_feature

    #Encoding the variables in the sets
    X_train = encoding_vars(X_train, ordinal_cols, ordinal_list)
    X_val = encoding_vars(X_val, ordinal_cols, ordinal_list)
    X_test = encoding_vars(X_test, ordinal_cols, ordinal_list)

    return X_train, y_train, X_val, y_val, X_test, y_test

Then apply the function to your dataframe to get your train, validation, and test sets:

X_train, y_train, X_val, y_val, X_test, y_test = split_df(df, ordinal_cols, ordinal_list, target='target_feature')

3. Train your model

Here, I'm assuming you've already:

Cleaned and preprocessed your data
Hyperturned your parameters
Optimized your model

import lightgbm as lgb
#Build your model
clf =  lgb.LGBMClassifier(objective='binary', max_depth=-1, random_state=314, metric='roc_auc', n_estimators=5000, num_threads=16, verbose=-1,
                           **best_hyperparameters)

#Fit your model
clf.fit(X_train, y_train, eval_set=(X_val, y_val), eval_metric='roc_auc')

#Make the predictions
roc_auc_score(y_test, clf.predict(X_test))

4. Feature Importance using SHAP

And here's the moment you've been long waiting for!

The SHAP package has many different types of visualizations, depending on whether you want to have a global interpretation (all users aggregated) or a local one (per user). In my research, I focused on the global interpretation of my results as I did not care about a specific user.

I used one specific type of chart, which I find to be pretty sufficient to visualize the impact of your features – Summary Plots. You will get: