Digital Marketing Analysis with Python and MySQL

Introduction
Along this brief journey, we will explore a short and simple dataset with a collection of basic website marketing metrics such as ‘users', ‘sessions' and ‘bounces', over a five months period.
The purpose of this setup, rather than focusing on understanding the website performance, is to gain some basic but somehow useful knowledge to answer a number of must-have operational marketing questions.
We'll focus on two powerful and most-used digital tools exploring two ways leading us to the same results at the end of the day.
On the one hand, we will explore the syntax of MySQL Workbench with some diverse queries, in parallel, for each question, with the syntax of Python using graphic and visual resources. Both environments will be titled as # MySQL and # Python, respectively. For each question, with notes and explanations on both codes for a deeper understanding.
MySQL
-- displaying dataset (case_sql.csv)
SELECT * FROM case_sql;

You can download the SQL dataset here.
Python
# import python libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.style as style
%matplotlib inline
color = sns.color_palette()
from pandas.plotting import table
from datetime import datetime
# load the data set
df = pd.read_csv("case.csv", sep=";")
# number of observations and columns
df.shape
(31507, 7)
# display rows sample
df.sample(15)

# SHAPE
# Create a function that prints the shape of the dataframe and some other basic info
# --> number of observations, features, duplicates, missing values (True, False) datatypes and its proportion.
def shape_df(df):
print(f"Number of observations: {df.shape[0]}")
print(f"Number of variables: {df.shape[1]}")
print(f"Number of duplicates: {df.duplicated().sum()}")
print(f"Are there any missing values? {df.isnull().values.any()}n-----")
print(f"{df.dtypes.sort_values(ascending=True)}n-----")
print(f"Datatypes' proportion:n{df.dtypes.value_counts(ascending=True)}")
# calling the function
shape_df(df)
Number of observations: 31507
Number of variables: 7
Number of duplicates: 4083
Are there any missing values? False
-----
date int64
users int64
sessions int64
bounces int64
brand object
device_category object
default_channel_grouping object
dtype: object
------
Datatypes proportion:
object 3
int64 4
dtype: int64
# lowering columns' capital letters for easy typing
df.columns = map(str.lower, df.columns)
# remove spaces in columns name
df.columns = df.columns.str.replace(' ','_')
Convert dates into datetime type
# make string version of original column 'date', call it 'date_'
df['date_'] = df['date'].astype(str)
# create the new columns using string indexing
df['year'] = df['date_'].str[0:4]
df['month'] = df['date_'].str[4:6]
df['day'] = df['date_'].str[6:]
# concatenate 'year', 'month' and 'day'
df["date"] = df["year"] + "-" + df["month"] + "-" + df["day"]
# convert to datetime
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d")
# extract 'year', 'month' and 'weekday'
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["weekday"] = df["date"].dt.dayofweek.map({0 : "Mon", 1 : "Tue", 2 : "Wed", 3: "Thu", 4 : "Fri", 5 : "Sat", 6 : "Sun"})
# select columns to perform exploratory data analysis
cols = "date year month weekday brand device_category default_channel_grouping users sessions bounces".split()
df = df[cols].copy()
# display final dataset
df.head(10)

You can download the Python dataset here.
1. Let's look at the distribution of users by type of device
MySQL
-- Device distribution
SELECT
device_category, -- select the device_category column
ROUND(COUNT(users) / (SELECT
COUNT(users)
FROM
case_sql) * 100,
1) AS percent -- calculate the percentage of users in each category
FROM
case_sql -- select data from the case_sql table
GROUP BY 1 -- group the result by device_category
ORDER BY 1; -- order the result by device_category in ascending order

We can see that the distribution by device shows mobile and desktop side by side as the most frequent access type.
# Python
# device distribution
# counts the number of occurrences of each unique device category in the device_category column of the DataFrame df, including missing values (if any).
df.device_category.value_counts(dropna=False).plot(kind='pie', figsize=(8,4),
explode = (0.02, 0.02, 0.02),
autopct='%1.1f%%',
startangle=150);
# Params
plt.xticks(rotation=0, horizontalalignment="center")
plt.title("Device distribution", fontsize=10, loc="right");

2. What is the above scenario on a brand perspective?
MySQL
-- Brand distribution
SELECT
brand, -- select the brand column
COUNT(users) AS users, -- count the number of users for each unique brand and alias the result as "users"
ROUND(COUNT(users) / (SELECT
COUNT(users)
FROM
case_sql) * 100,
2) AS percent -- calculate the percentage of users for each brand out of the total number of users in the case_sql table and alias the result as "percent"
FROM
case_sql -- select data from the case_sql table
GROUP BY 1; -- group the result by the first column (brand)

At a brand level, Brand 2 has the highest number of visits with 56.28% against 43.72% of total visits on Brand 1.
Python
# Brand distribution
absolut = df["brand"].value_counts().to_frame()
# Pie chart
absolut.plot(kind='pie', subplots=True, autopct='%1.2f%%',
explode= (0.05, 0.05), startangle=20,
legend=False, fontsize=12, figsize=(8,4))
# Params
plt.xticks(rotation=0, horizontalalignment="center")
plt.title("Brand's distribution", fontsize=10, loc="right");
display(absolut) # Table

3. On which day of the week did we see most users arrive on the Brand 1 website?
MySQL
SELECT
date, -- select the date column
DAYNAME(date) AS day_name, -- calculate the day name corresponding to each date
SUM(users) AS users -- sum the number of users for each unique date where the brand is 'Brand 1'
FROM
case_sql -- select data from the case_sql table
WHERE
brand = 'Brand 1' -- filter rows where the brand is 'Brand 1'
GROUP BY 1 -- group the result by date
ORDER BY 3 DESC -- order the result by users in descending order
LIMIT 1; -- select only the first row of the result (the row with the highest number of users)

From the 298 412 users between Sep 2019 and Jan 2020, the day most users arrive on the Brand 1 website was on the 2019–11–22 with a total of 885 visits, it was a Friday.
Python
# filter users that arrived at 'Brand 1' only, assign it 'brand_1'
brand_1 = df[df["brand"] == "Brand 1"].copy()
''' sum total users that came from all "channelgrouping" for the same date,
assign it 'brandgroup' no matter the type of device '''
brandgroup = brand_1.groupby(["date","weekday"])[["default_channel_grouping","users"]].sum()
# filter the date by maximum users, assign it 'users'
users = brandgroup[brandgroup["users"] == brandgroup.users.max()].copy()
# reseat index
users.reset_index(["date"], inplace=True)
users.reset_index(["weekday"], inplace=True)
# results
print(f"""Date: {users.date} nnTotal users: {users.users} nnDay of week: {users.weekday}""")
Date: 0 2019-11-22
Name: date, dtype: datetime64[ns]
Total users: 0 885
Name: users, dtype: int64
Day of week: 0 Fri
Name: weekday, dtype: object
# calling the variable
users

3.1 How many users landed on Brand 2 on that same day?
MySQL
SELECT
DATE(date) AS date, -- Select the date from the 'date' column and convert it to a date data type
DAYNAME(date) AS dayofweek, -- Select the day of the week from the 'date' column
SUM(CASE
WHEN brand = 'Brand 1' THEN users -- Sum the 'users' column for Brand 1
ELSE NULL
END) AS b1_users,
SUM(CASE
WHEN brand = 'Brand 2' THEN users -- Sum the 'users' column for Brand 2
ELSE NULL
END) AS b2_users
FROM
case_sql -- From the 'case_sql' table
GROUP BY 1, 2 -- Group the results by the first and second columns (date and dayofweek)
ORDER BY 3 DESC -- Order the results by b1_users in descending order
LIMIT 1; -- Limit the results to only the highest total number of Brand 1 users

Actually, both brands saw the highest number of visits on the same day.
Python
# filter users that arrived at 'Brand 2', assign it 'brand_2'
brand_2 = df[df["brand"] == "Brand 2"].copy()
# rename the 'users' column from previous (above) Python code
brandgroup.rename(columns = {'users':'brand1_users'}, inplace = True)
# include a new column with the filtered users from 'Brand_2'
brandgroup["brand2_users"] = brand_2.groupby(["date","weekday"])[["default_channel_grouping","users"]].sum()
# filter the new column (brand2_users) by maximum users
users2 = brandgroup[brandgroup["brand2_users"] == brandgroup.brand2_users.max()].copy()

4. Of all the Channel Groupings, which has contributed the highest number of users?
MySQL
SELECT
default_channel_grouping AS channels,
SUM(users) AS total_users,
ROUND(SUM(users) / (SELECT
SUM(users)
FROM
case_sql) * 100,
1) AS percent -- calculate the percentage of users for each channel
FROM
case_sql
GROUP BY 1
ORDER BY 2 DESC;

Organic Search is by far the channel that generates most users (almost 141000) representing almost half of total visits on both websites, followed by Paid Search and Direct. Display takes the 4th position and Social the 6th contributing with 6722.
Python
# sum users by all channel groups and plot bar chart
ax = df.groupby("default_channel_grouping")["users"].sum().sort_values(ascending=True)
.plot(kind="bar", figsize=(9,6), fontsize=12, linewidth=2,
color=sns.color_palette("rocket"), grid=False, table=False)
# show data labels
for p in ax.patches:
ax.annotate("%.0f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 7), textcoords='offset points')
# params
plt.xlabel("Channel groups", fontsize=10)
plt.xticks(rotation=90, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Best channel group (highest number of users)", fontsize=10, loc="right");

4.1 Of all the Channel Groupings, which has contributed the highest number of users breakdown by brand?
MySQL
SELECT
default_channel_grouping AS channels,
SUM(CASE -- sum users by brand and map to new columns
WHEN brand = 'brand 1' THEN users -- if brand = 'brand 1', sum users and store in 'Brand_1' column
ELSE NULL -- if not 'brand 1', set value to null
END) AS Brand_1, -- create column for Brand 1 users
SUM(CASE
WHEN brand = 'brand 2' THEN users
ELSE NULL
END) AS Brand_2
FROM
case_sql
GROUP BY 1 -- group by channel
ORDER BY 3 DESC; -- order by Brand 2 users in descending order

Python
# create pivot_table
# sum all users for each brand by channels
type_pivot = df.pivot_table(
columns="brand",
index="default_channel_grouping",
values="users", aggfunc=sum)
display(type_pivot)
#Display pivot_table with a bar chart
type_pivot.sort_values(by=["Brand 2"], ascending=True).plot(kind="bar", figsize=(12,8) ,fontsize = 15)
plt.xlabel("Channel groups", fontsize=10)
plt.xticks(rotation=90, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Channel groups by brand (highest number of users)", fontsize=10, loc="right");


Organic Search has contributed by 105.062 of users for Brand 2, and by 35.911 users for Brand 1. With the exception of ‘Other' in which the Brand 1 is superior, the Brand 2 contributes the highest delivering users to the website in all channels.
5. Amongst all channels, which brand has contributed with a percentage of at least 5% of paid sessions during 2019?
MySQL
SELECT
brand,
default_channel_grouping AS channels,
ROUND(SUM(sessions) / (SELECT
SUM(sessions)
FROM
case_sql) * 100,
1) AS percent
FROM
case_sql
WHERE
default_channel_grouping IN ('Paid Search' , 'Paid Social', 'Display', 'Other Advertising') -- include only rows with these values
AND date < '2020-01-01' -- only date before '2020-01-01' will be included.
GROUP BY 1 , 2
HAVING percent > 5 -- filters the groups to only include values greater than 5%.
ORDER BY 1 , 3 DESC

Python
# groupby dataframe by selected cols
df = df.groupby(["date","brand","default_channel_grouping"])["sessions"].sum().to_frame().copy()
# calculate percentages (new column)
df["percent"] = (df.apply(lambda x: x/x.sum())*100).round(2)
# reset index
df = df.reset_index().copy()
# display a 5 rows sample
df.sample(5)

# filter paid channels using lambda function
paid = df.apply(lambda row: row[df['default_channel_grouping'].isin(['Display','Paid Search','Paid Social','Other Advertising'])])
# filter year 2019
paid = paid[paid['date'] < '2020-01-01']
# groupby channels by brand
paid = paid.groupby(["brand","default_channel_grouping"])[["sessions","percent"]].sum()
# filter sessions higher than 5%
paid[paid["percent"] >5]

6. How many visits did both brands received by type of device?
MySQL
SELECT
brand,
SUM(CASE
WHEN device_category = 'Desktop' THEN users
ELSE NULL
END) AS desktop,
SUM(CASE
WHEN device_category = 'Mobile' THEN users
ELSE NULL
END) AS mobile,
SUM(CASE
WHEN device_category = 'Tablet' THEN users
ELSE NULL
END) AS tablet
FROM
case_sql
GROUP BY 1
ORDER BY 1;

Python
# pivot_table
type_pivot = df.pivot_table(
columns="device_category",
index="brand",
values="users", aggfunc=sum)
display(type_pivot)
# display pivot_table (chart)
ax = type_pivot.sort_values(by=["brand"], ascending=False).plot(kind="bar", figsize=(12,8) ,fontsize = 15);
# adding data labels
for p in ax.patches:
ax.annotate("%.0f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 7), textcoords='offset points')
plt.xlabel("Brands", fontsize=10)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Brand by type of device", fontsize=10, loc="right")
plt.legend(["Desktop","Mobile","Tablet"]);

Mobile is the preferred type of device at Brand 2 whereas desktop is the most used device at Brand 1.
6.1 What is the user average usage of device type by channels?
MySQL
SELECT
default_channel_grouping,
AVG(CASE
WHEN device_category = 'Desktop' THEN users
ELSE NULL
END) AS desktop,
AVG(CASE
WHEN device_category = 'Mobile' THEN users
ELSE NULL
END) AS mobile,
AVG(CASE
WHEN device_category = 'Tablet' THEN users
ELSE NULL
END) AS tablet
FROM
case_sql
GROUP BY 1
ORDER BY 1;

Python
# pivot_table
type_pivot = df.pivot_table(
columns="device_category",
index="default_channel_grouping",
values="users", aggfunc=np.mean)
display(type_pivot)

# display pivot_table
type_pivot.sort_values(by=["default_channel_grouping"], ascending=False).plot(kind="bar", figsize=(12,8) ,fontsize = 15);
plt.xlabel("Date", fontsize=10)
plt.xticks(rotation=90, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Average use of device types by channel grouping", fontsize=10, loc="right")
plt.legend(["Desktop","Mobile","Tablet"]);

On average, desktop is more often used regarding Referral, Direct and Others. As for the other channels, vertical oriented content should be always taken in consideration.
7. How to assess the bounce rate of the Channel Groupings?
The Bounce Rate is calculated as the total number of bounces divided by the total number of sessions.
MySQL
SELECT
default_channel_grouping,
SUM(sessions) AS sessions,
SUM(bounces) AS bounces,
ROUND(SUM(bounces) / SUM(sessions) * 100, 2) AS bounces_r
FROM
case_sql
GROUP BY 1
ORDER BY 4 DESC;

Average Bounce Rate: 54.93% (avg_bounces_r)
SELECT
SUM(sessions) AS sessions,
SUM(bounces) AS bounces,
ROUND(SUM(bounces) / SUM(sessions) * 100, 2) AS bounces_r,
AVG(ROUND(bounces/sessions*100, 2)) AS avg_bounces_r
FROM
case_sql;
Python
# group individual channels by sum of users
dfbounce = df.groupby("default_channel_grouping")["users"].sum().to_frame()
# group individual channels by sum of sessions
dfbounce["sessions"] = df.groupby("default_channel_grouping")["sessions"].sum()
# group individual channels by sum of bounces
dfbounce["bounces"] = df.groupby("default_channel_grouping")["bounces"].sum()
# calculus of bounce rate for each individual channel
dfbounce["bounces_r"] = dfbounce.apply(lambda x: 0.0 if x["sessions"] == 0.0 else (x["bounces"] / x["sessions"])*100, axis=1).round(2)
dff = dfbounce.copy()
dfbounce.drop(["users"],axis=1,inplace=True)
# sort values by rate
dfbounce.sort_values(by="bounces_r", ascending=False)

# display bar chart with the bounce rate for each channel
ax = dfbounce.groupby("default_channel_grouping")["bounces_r"].sum().sort_values(ascending=True)
.plot(kind="bar", figsize=(9,6), fontsize=12, linewidth=2, color=sns.color_palette("rocket"), grid=False, table=False)
for p in ax.patches:
ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 7), textcoords='offset points')
plt.axhline(dfbounce.groupby("default_channel_grouping")["bounces_r"].mean().mean(), linewidth=1, color ="r")
plt.xlabel("channel groups", fontsize=10)
plt.xticks(rotation=90, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Bounce rate by channelGrouping", fontsize=10, loc="right");

As excepted, Display has the highest bounce rate of all channels followed by Direct and Paid Social. Organic Search is levelled with the average. Below the threshold with the lowest Bounce Rates we find Referral, Native and Other Advertising.
7.1 Does the Bounce Rate on the site appear to be improving or deteriorating over time?
MySQL
SELECT
YEAR(date) AS year, -- extract year
MONTH(date) AS month, -- extract month
DATE_FORMAT(date, '%b') AS month_, -- format the date column to display month name
ROUND(SUM(bounces) / SUM(sessions) * 100, 2) AS bounces_r -- calculate bounce rate
case_sql
GROUP BY 1 , 2 , 3
ORDER BY 1 , 2 , 3;

Python
df_date = df.groupby("date")[['sessions','bounces']].sum()
''' create function to assess the bounce rate, assign it as 'bounce_r'
Return 0 if session's value is 0, else divide the bounces by sessions
for each date and multiply it by 100 to get the percentage '''
def div(bounces, sessions):
return lambda row: 0.0 if row[sessions] == 0.0 else float((row[bounces]/(row[sessions])))*100
# create column 'bounce_r' with the function results
df_date["bounce_r"] = (df_date.apply(div('bounces', 'sessions'), axis=1)).round(1)
# drop unnecessary columns
df_date.drop(["sessions","bounces"], axis=1, inplace=True)
# sum all bounces over time and plot chart
ax = df_date.plot(kind="line", figsize=(14,6), fontsize=12, linewidth=2)
plt.xlabel("Date", fontsize=10)
plt.xticks(rotation=90, horizontalalignment="center")
plt.ylabel("Rate", fontsize=10)
plt.title("Evolution of the bounce rate over time", fontsize=10, loc="right");

# Smoothing the line with a step of 15 days interval
resampled = df_date["bounce_r"].resample("m").mean()
plt.figure(figsize = (12,6))
ax = sns.lineplot(data = resampled)
plt.title("Evolution of the bounce rate over time (smooth)", fontsize=10, loc="right")
plt.xlabel("Date", fontsize=10)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Rate", fontsize=10);

The Bounce Rate on site is improving over time.
7.2 Bounce Rate breakdown by Channel and Brand
Python
# filter by brand
b1 = df[df["brand"] == "Brand 1"]
b2 = df[df["brand"] == "Brand 2"]
# ** brand 1 **
# group individual channels by sum of sessions for brand 1
dfbrand = b1.groupby("default_channel_grouping")["sessions"].sum().to_frame()
dfbrand.rename(columns={"sessions":"sessions1"}, inplace=True)
# group individual channels by sum of bounces for brand 1
dfbrand["bounces1"] = b1.groupby("default_channel_grouping")["bounces"].sum()
# calculus of bounce rate for each individual channel for brand 1
dfbrand["1bounces_r"] = dfbrand.apply(lambda x: 0.0 if x["sessions1"] == 0.0 else (x["bounces1"] / x["sessions1"]*100), axis=1).round(2)
# ** brand 2 **
# group individual channels by sum of bounces for brand 2
dfbrand["sessions2"] = b2.groupby("default_channel_grouping")["sessions"].sum()
# group individual channels by sum of bounces for brand 2
dfbrand["bounces2"] = b2.groupby("default_channel_grouping")["bounces"].sum()
# calculus of bounce rate for each individual channel for brand 2
dfbrand["2bounces_r"] = dfbrand.apply(lambda x: 0.0 if x["sessions2"] == 0.0 else (x["bounces2"] / x["sessions2"]*100), axis=1).round(2)
# sort values by rate
dfbrand.sort_values(by="1bounces_r", ascending=False)

# clean dataframe
dfchannels = dfbrand.copy()
dfbrand_chart = dfbrand.copy()
dfbrand_chart.drop(["sessions1","sessions2","bounces1","bounces2"], axis=1, inplace=True)
# display bar chart with the average bounce rate for each channel
ax = dfbrand_chart.plot(kind="bar", figsize=(13,6), fontsize=12, linewidth=2, color=sns.color_palette("BrBG"), grid=False, table=False)
for p in ax.patches:
ax.annotate("%.1f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 7), textcoords='offset points')
plt.xlabel("channel groups", fontsize=10)
plt.xticks(rotation=90, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Bounce rate by channelGrouping and by brand", fontsize=10, loc="right")
plt.legend(["Brand 1","Brand 2"]);

Brand 1 has a lower rate of what would be expected for the Display channel. Usually, this channels drive a high Bounce Rate. Important to understand in detail the content strategy and try to adapt it to Brand 2.
Brand 2 shows a higher bounce rate of what would be acceptable in the Referral channel.
8. Proportions between inbound and paid media
MySQL
SELECT
brand,
CASE
WHEN
default_channel_grouping IN ('Paid Search',
'Paid Social',
'Display',
'Other Advertising')
THEN
'Paid'
WHEN
default_channel_grouping IN ('Direct',
'Native',
'Organic Search',
'Referral',
'Social',
'Email',
'(Other)')
THEN
'Organic'
ELSE NULL
END AS media,
ROUND(SUM(bounces) / SUM(sessions) * 100, 2) AS bounce_r
FROM
case_sql
GROUP BY brand , media
ORDER BY 1;

Python
# create dictionary
media_dict =
{
'Display': 'paid',
'Paid Search': 'paid',
'Paid Social': 'paid',
'Other Advertising': 'paid',
'Direct': 'organic',
'Native': 'organic',
'Organic Search': 'organic',
'Referral': 'organic',
'Social': 'organic',
'Email': 'organic',
'(Other)': 'organic'
}
# mapping the dict into a new column
df['media'] = df['default_channel_grouping'].map(media_dict)
# define cols position in dataframe
cols = ['brand','media','sessions','bounces']
# reindex columns order
df = df.reindex(columns = cols)
# groupby dataframe by selected cols
df = df.groupby(["brand","media"])[["sessions","bounces"]].sum()
# bounce rate by channel
df["bounces_r"] = df.apply(lambda x: 0.0 if x["sessions"] == 0.0 else (x["bounces"] / x["sessions"])*100, axis=1).round(2)

Conclusion
As promised, we went through a step-by-step approach to conducting a simple digital Marketing Analysis working alongside MySQL Workbench and Python.
Both tools have their specificities, their demands, but the reasoning is relatively similar, leaving aside their graphic capabilities and limitations.
Feel free to download the datasets and explore by practicing some of the technical details covered here, implementing new code as you ask further analysis questions.
Explore others projects you might also like:
SQL Digital Marketing Analysis
15 Business Questions about Mobile Marketing Campaigns: ROAS (Return On Ad Spend)
How to get in contact with me:
✅ Thanks for reading!