Building and Evaluating Classification Models to Predict Customer Churn with Tidymodels

Author:Murphy | View: 28771 | Time: 2025-03-22 21:42:07

Building and Evaluating Customer Churn Classification Models with Tidymodels

When I first learned to build models, eons ago, there were many approaches to constructing models across different packages with different parameter names. Then, when I started using tidymodels, I was pleasantly surprised at how efficiently it allowed me to write my code for model building across various scenarios and engines in a consistent manner. This meant that I no longer had to remember a hundred different formats and parameters for different models, and it became much easier to compare the outputs.

In this article, I am going to take the example of a very common customer churn prediction scenario to guide you through the standardized format of building models the tidy way and comparing results.

Code

Code for everything in this article can be found at my GitHub Repo.

Let's get started with modeling

Step 0: Set up the environment

Install the required packages by either running individual commands such as install.packages("package_name"), or load all packages by running the command below, which installs them only if they do not already exist, and then loads them.

# Loading packages

packages <- c("tidyverse", "tidymodels", "skimr", "GGally")

package.check <- lapply(packages, FUN = function(x) {
  if (!require(x, character.only = TRUE)) {
    install.packages(x, dependencies = TRUE)
    library(x, character.only = TRUE)
  }
})

Step 1: The dataset

For this exercise, I use a Binary Classification of Bank Churn Synthetic Data by Simarpreet Singh, available under a Creative Commons Attribution 4.0 International License (CC BY 4.0) from Kaggle.

The dataset comprises the column "Exited" that denotes whether the customer left or not, this is what I will predict. I will first load the dataset and get a glimpse of what the dataset looks like:

bankchurn_df <- read.csv("./data/bank_churn.csv")

bankchurn_df |> 
  glimpse()

Step 2: Data cleaning

I will remove the columns that contain the term 'Surname' because they won't be relevant as features for the model. I will also convert the output variable, Exited, to a factor variable for binary classification.

bankchurn_df_upd <- bankchurn_df |> 
  select(Exited, everything()) |> 
  mutate(Exited = as.factor(Exited)) |> 
  select(-contains("Surname"))

Step 3: Exploratory Data Analysis

I will use the skim() command to generate an initial understanding of the data before diving into certain columns. This command provides the distribution of each column that helps me see that not all variables are on the same scale.

bankchurn_df_upd |> 
  skim()

Now, I will examine a few predictor variables to understand their relationship with the output variable using ggpairs().

bankchurn_df_upd |> 
  select(Exited, CreditScore, Age, Tenure, Balance) |> 
  ggpairs(mapping = aes(colour = Exited, alpha = 0.3)) +
  scale_fill_manual(values=c('darkgreen', 'red')) +
  scale_colour_manual(values=c('darkgreen', 'red'))

This package quickly helps understand the relationships between variables. I will now check the relationships between Gender, Location, and the output variable as well.

The correlation parameters highlight if there are any strong relationships among predictor variables as well.

Let's move to the modeling phase now.

Step 4: Train/Test split

As in any modeling, my first step will be to hold out a random test set to predict on for trained model accuracy. I will use initial_split() command for this. I will set the seed for reproducibility and access the splits using training() and testing() functions.

set.seed(123)
bc_split <- initial_split(bankchurn_df_upd, prop = 3/4, strata = "Exited")

train_data <- training(bc_split)
test_data <- testing(bc_split)

Step 5: Feature engineering

Based on the exploratory data analysis done previously, I will do some base feature engineering, which is made very easy using recipe(). It helps create a repeatable set of steps without having to write detailed code for common feature engineering tasks. In this scenario, I want to:

Turn nominal variables to dummy variables, keeping aside the output.
Remove any variables that may contain only a single value or have zero variance.
Normalize the numeric predictor variables since some of them are on different scales.

bc_recipe <- recipe(Exited ~ ., data = bankchurn_df_upd) |>
  step_dummy(all_nominal(), -all_outcomes()) |>
  step_zv(all_numeric()) |>
  step_normalize(all_numeric()) |>
  prep()

bc_recipe |> 
  bake(new_data = NULL)

Step 5: Create model specifications and workflows

I will create two model specifications and workflows, which will showcase how parsnip within tidymodels allows for a standardized approach to create these across various models.

Logistic regression:

lr_mod <- logistic_reg() |> 
  set_mode("classification") |> 
  set_engine("glm")

lr_mod

I will now create a workflow that combines the recipe with the model for training.

lr_workflow <- 
  workflow() |> 
  add_model(lr_mod) |> 
  add_recipe(bc_recipe)

lr_workflow

Random forest:

I will repeat the steps above for random forest model. Note the similar structure of creating model specifications and workflow.

rand_forest_ranger_model <- rand_forest(
  mode = "classification", mtry = 10, trees = 500, min_n = 20) |>
  set_engine("ranger", importance = "impurity") 

rand_forest_ranger_model

rf_workflow <- workflow() |> 
  add_model(rand_forest_ranger_model) |> 
  add_recipe(bc_recipe)

rf_workflow

Step 6: Fit the data

I will use the fit() function to fit the data to training dataset using the workflow created.

lr_fit <- 
  lr_workflow |>
  fit(data = train_data)

rf_fit <- 
  rf_workflow |>
  fit(data = train_data)

Step 7: Feature importance

For linear models, there is a command called tidy() that allows accessing the components of the previously fitted model in an easy manner. Once I access these, I will arrange the predictor variables by a certain metric, in this case p-value.

lr_fit |> 
  extract_fit_parsnip() |> 
  tidy() |> 
  arrange(p.value)

For random forest, I will use extract_fit_parsnip() to extract the fit. I will then use the importance() command to extract the features and their metrics.

extract_fit_parsnip(rf_fit)$fit |> 
  ranger::importance() |> 
  enframe() |> 
  arrange(desc(value))

Step 8: Predicting on test dataset

For both the models, I will use the predict() function on the previously fitted model to drive predictions on the test dataset. For binary classification, by default this returns the type class of 0 or 1 as the prediction. To get the value as a probability, I will update the parameter type to "prob".

# Logistic regression
class_pred_lr <- predict(lr_fit, test_data)
prob_pred_lr <- predict(lr_fit, test_data, type = "prob")

# Random forest
class_pred_rf <- predict(rf_fit, test_data)
prob_pred_rf <- predict(rf_fit, test_data, type = "prob")

Here is an example of what the class and probability outputs look like:

To access both of these within the same table, I will combine these two types of predictions into a single dataframe, as I think both of these classes of outputs are relevant dependent on the business scenario. I will then bind these outputs to the original test dataset, so I can see the class, probabilities, and original "Exited" values within the same table to compare results. Below is what the final combined table looks like.

# Logistic regression
lr_preds_combined <- 
  data.frame(class_pred_lr, prob_pred_lr) |> 
  select(class = .pred_class, prob_no = .pred_0, prob_yes = .pred_1) |> 
  bind_cols(test_data)

# Random forest
rf_preds_combined <- 
  data.frame(class_pred_rf, prob_pred_rf) |> 
  select(class = .pred_class, prob_no = .pred_0, prob_yes = .pred_1) |> 
  bind_cols(test_data)

Now that I have the predictions, let's move on to measuring accuracy for these models.

Step 9: Understanding performance metrics

ROC Curve

ROC curve shows the performance of a binary classification method at different threshold values. It plots the true positive rate (TPR) against the false positive rate (FPR). I will get the values for the roc curve using roc_curve() function.

# Logistic regression
lr_roc <- lr_preds_combined |> 
  roc_curve(truth = Exited, prob_no) |> 
  mutate(model = "Logistic Regression") 

# Random forest
rf_roc <- rf_preds_combined |> 
  roc_curve(truth = Exited, prob_no) |> 
  mutate(model = "Random forest")

# Combine roc curve values
lr_roc |> 
  bind_rows(rf_roc) |> 
  glimpse()

I can use autoplot() function to visualize the roc curve, but in this case I will plot it from scratch using ggplot2 to see both model roc curves on the same plot.

lr_roc |> 
  bind_rows(rf_roc) |> 
  ggplot(aes(x = 1 - specificity, y = sensitivity, color = model)) +
  geom_line() +
  geom_abline(lty = 2) +
  labs(y = "True Positive Rate", 
       x = "False Positive Rate",
       title = "ROC curve") +
  theme_bw()

Confusion matrix

Confusion matrix is a table that summarizes the performance of a classification model. The package caret within tidymodels provides a function called confusionMatrix() that provides a lot of useful information. I will use this to compare metrics for both the models before selecting the final one.

# Logistic regression
caret::confusionMatrix(lr_preds_combined$class,
                                lr_preds_combined$Exited,
                                positive = "1")

# Random forest
caret::confusionMatrix(rf_preds_combined$class,
                                rf_preds_combined$Exited,
                                positive = "1")

The random forest model performs better across most metrics, including accuracy and specificity, while logistic regression has slightly higher sensitivity. The compute needs for random forest however can be higher. So depending on the importance of gain in accuracy, specificity and sensitivity to the specific business scenario, either models can be highlighted.

Next steps

I hope this article helped you understand the powerful consistent approach that the series of packages within the tidymodels framework provides. While I only covered two models to showcase its value, there are a lot of other models that can be built using similar syntax.

Another approach to make these models even more trustworthy would be to cross-validate using the tools available in tidymodels. I will cover that in another article.

Code for everything in this article can be found at my GitHub Repo. If you'd like, find me on Linkedin.

All images in this article are by the author, unless mentioned otherwise.

Tags: Data Science Data Visualization Getting Started Machine Learning Predictive Analytics