Mastering Customer Segmentation using Credit Card Transaction Data
Customer segmentation is the process of identifying customer segments based on historical purchasing patterns. For example, it can involve identifying repeat/loyal customers, high spending customers, customers that make one time or infrequent purchases and much more. Segments can be created using information like frequency of purchases, transaction amounts, purchase dates and more. All of these characteristics can be used to generate well defined clusters with easy to interpret characteristics.
The characteristics of these segments can provide a great deal of information and business value for companies. For example, companies can use these segments to increase revenue through targeted promotional messaging, customer retention campaigns, loyalty programs, product cross selling and more. Companies can target customer segments with tailored messaging that is relevant to each segment. Further, these segments can provide information about which channels customers are most responsive to, whether it's email, social media or external applications. Companies can also perform upselling or cross-selling using consumer segments. For example, they can offer high priced options for frequently purchased items or complementary products to previously purchased items. All of these tactics can be used to increase revenue and customer retention.
There are a wide variety of techniques used for customer segmentation. One popular technique for generating customer segments is recency frequency and monetary value (RFM) scores.
Recency, Frequency and Monetary
- Recency is the number of days between the last purchase made by a customer and a reference date, usually the current or max date available in the data.
- Frequency is the number of purchases between the date of their last purchase and the current or max date available in the data.
- Monetary is the total amount of money spent between the date of their first and last purchases.
You can use these values to construct RFM Scores which can be used to segment and identify high and low value customers. These scores can be used for a wide variety of of business use cases included custom marketing, churn analysis, price optimization and more.
Here we will see how to calculate RFM scores using a credit card transaction dataset. For our purposes we will be working with the Synthetic Credit Card Transaction data available on DataFabrica. The data contains synthetic credit card transaction amounts, credit card information, transaction IDs and more. The free tier is free to download, modify, and share under the Apache 2.0 license.
For this work, I will be writing code in Deepnote, which is a collaborative Data Science notebook that makes running reproducible experiments very easy.
Exploring the Data
To start, let's navigate to Deepnote and create a new project (you can sign-up for free if you don't already have an account).
Let's install necessary packages:
And import the packages we will be using:
Next, let's read our data into a pandas dataframe and display the first five rows of data:
Next let's filter our dataframe to only include customers that purchased Chick-fil-A:
Next we can look at the number for customers by state. To do this we need to map the merchant_state to state abbreviations, which will allow us to plot the customer counts for each state in Plotly:
Next let's map the state abbreviations and define our state_count table. We do this by performing a groupby nunique()
operation for each cardholder for each state:
Python">df['merchant_state_abbr'] = df['merchant_state'].map(state_abbreviations)
state_counts = df.groupby('merchant_state_abbr')['cardholder_name'].nunique().reset_index()
state_counts.columns = ['State', 'Customer_Count']
Next we can use the Plotly Express chloropleth
method to generate a geoplot of customer counts for each state:
fig = px.choropleth(state_counts, locations='State', locationmode='USA-states',
color='Customer_Count', scope='usa',
color_continuous_scale='Blues',
title='Number of Customers by State')
fig.show()
The full logic is:
Generating RFM Scores
Now let's define the logic for creating our RFM scores. Let's start by converting our transaction_date
to a pandas datetime object and a NOW
variable that is the maximum transaction_date
:
df['transaction_date'] = pd.to_datetime(df['transaction_date'])
NOW = df['transaction_date'].max()
Next let's perform a groupby aggregation operation that allows us to calculate recency, frequency and monetary value.
- Recency – max overall date minus max customer date:
df.groupby('cardholder_name').agg({'transaction_date': lambda x: (NOW - x.max()).days})
- Frequency – number of transaction IDs for each customer:
df.groupby('cardholder_name').agg({transaction_id': lambda x: len(x)})
- Monetary value – sum of transaction amount for each customer:
df.groupby('cardholder_name').agg({transaction_amount': lambda x: x.sum()})
We'll also convert transaction_date
, which is transformed to recency, to an integer:
rfmTable = df.groupby('cardholder_name').agg({'transaction_date': lambda x: (NOW - x.max()).days, 'transaction_id': lambda x: len(x), 'transaction_amount': lambda x: x.sum()})
rfmTable['transaction_date'] = rfmTable['transaction_date'].astype(int)
Next, let's rename our columns appropriately.
transaction_date
becomesrecency
transaction_id
becomesfrequency
transaction_amount
becomemonetary_value
rfmTable.rename(columns={'transaction_date': 'recency',
'transaction_id': 'frequency',
'transaction_amount': 'monetary_value'}, inplace=True)
rfmTable = rfmTable.reset_index()
The full logic is:
We can look at the distribution in recency:
Frequency:
And monetary value:
Next we can calculate quartiles using the Pandas qcut
method for recency, frequency and monetary value:
rfmTable['r_quartile'] = pd.qcut(rfmTable['recency'], q=4, labels=range(1,5), duplicates='raise')
rfmTable['f_quartile'] = pd.qcut(rfmTable['frequency'], q=4, labels=range(1,5), duplicates='drop')
rfmTable['m_quartile'] = pd.qcut(rfmTable['monetary_value'], q=4, labels=range(1,5), duplicates='drop')
rfm_data = rfmTable.reset_index()
Next we can visualize recency/frequency heatmap, where each cell displays the percentage of customers with the corresponding recency and frequency values. First let's calculate the percentages:
heatmap_data = rfm_data.groupby(['r_quartile', 'f_quartile']).size().reset_index(name='Percentage')
heatmap_data['Percentage'] = heatmap_data['Percentage'] / heatmap_data['Percentage'].sum() * 100
Next let's generate our heatmap matrix:
heatmap_matrix = heatmap_data.pivot('r_quartile', 'f_quartile', 'Percentage')
Generate the map and label/title our heatmap using Seaborn and Matplotlib:
sns.set()
sns.heatmap(heatmap_matrix, annot=True, fmt=".2f", cmap="YlGnBu")
plt.title("Customer Segmentation Heatmap")
plt.xlabel("Frequency Quartile")
plt.ylabel("Recency Quartile")
plt.show()
The full logic is:
We can see the following insights from our heatmap:
- 16.21% of customers purchased recently but infrequently.
- 3.45% of customers are frequent and recent customers.
- 10% of customers purchased frequently but not for a long time.
- 5.86% of our customer haven't purchased recently and don't purchase frequently.
We are only considering the single merchant Chick-fil-A, but I encourage you to repeat this analysis for some of the other casual and fine dining merchants.
Next we can generate our RFM Scores by concatenating the quartiles in recency, frequency and monetary value:
And we can visualize the distribution in our RFM scores:
Here we see that the most common RFM score is ‘411' which corresponds to a recent customer who spends infrequently and very little.
Generating and Visualizing Customer Segments using RFM scores
Next we can generate our customer segments. This step is a bit subjective, but we define out segments as follows:
- Premium Customer: r, f, and m all ≥ 3
- Repeat Customer: f >= 3 and r or m >=3
- Top Spender: m >= 3 and f or r >=3
- At-Risk Customer: two or more of r,f and m <=2
- Inactive Customer: two or more = 1
- Other: anything else
Next we can see the distribution in segments:
Here we see that the largest customer segment is inactive customers, followed by at-risk, top spenders, repeat customers and premium customers.
And visualize the distribution as well:
We can also look at the average monetary value for each customer segment:
We see that the average monetary value is highest for premium customers repeat customers, and top spenders. Further, inactive and at-risk customer have low monetary values.
The average frequency value for each customer segment:
We see that premium customers, repeat customers, and top spenders are the most frequency while at-risk and inactive customers have low frequency.
Finally average recency value for each customer segment:
We see that the ‘other' category have the most recent purchases. This This may be a good segment for "new" customers whose purchasing patterns are inconclusive.
Using Customer Segments for Custom Marketing
You can use these customer segments to generate customized/personalized marketing messages relevant to each segment. For example, you can reward premium customers with special promotions and loyalty deals. You can also promote other products that they may be interested in purchasing based on their purchase history. For repeat/loyal customers you can develop automated email campaigns to keep their loyalty. At-risk customers are typically disengaged. We can develop campaigns based on these customer to reengage them and get them to start purchasing again. A similar campaign can be developed for inactive customers. Finally, for top spenders we can offer special promotions and deals on highly price products that they are likely to purchase again.
Using this data, we can take our analysis a step further and use items
and prices
fields to craft these customized marketing campaigns for these segments.
I encourage you to download the code from GitHub and repeat this analysis on your own for some other restaurant merchants.
Conclusions
Here we discussed how to perform customer segmentation on synthetic credit card transaction data. We started by performing some simple data exploration where we looked at the count of customers in each state for the merchant Chick-fil-A. Next we calculated the columns needed for generating RFM scores, recency, frequency and monetary value. We then showed how to generate quartiles in recency, frequency and monetary value for each customer, which we then used to construct RFM scores. We then used the RFM score to generate customer segments "Premium Customer", "Repeat Customer", "Top Spender", "At-risk Customer" and "Inactive Customer". Next we generated some visualizations that allowed us to analyze the distribution in customer segments across the data.
Using RFM scores to generate insightful customer segments is an invaluable skill for any data scientist that wants to deliver business values. Constructing interpretable segments and pulling insights from these segments can help businesses design marketing campaign strategies that can increase revenue and customer retention. Understanding the purchasing behavior of customers allows businesses to tailor promotional offers to appropriate customer bases. This article provides the fundamentals needed to get started!
The free version of the synthetic credit card data is [here](https://datafabrica.info/products/customer-segmentation-credit-card-transaction-data-dining-small). The full data set can be found here.