Using Decision Trees for Exploratory Data Analysis

Author:Murphy | View: 23439 | Time: 2025-03-22 21:08:18

Introduction

The Decision Tree (DT) is the most intuitive Machine Learning algorithm.

There, I said it. This is just my opinion. But I am sure this is also a common feeling in the Data Science field.

Decision Tree (…) machine learning method that also makes complex decisions from sets of simple choices. [Brett Lantz, Machine Learning with R]

Very utilized in the operational research and data science fields, the reason for the success of the DT is that it follows a similar process to the human decision process. The process is based on a flow chart where each node will have a simple binary decision over a given variable, and that goes until we reach our final decision.

A simple example: buying a T-shirt. If I want to buy a shirt, I may have in mind a couple of variables like price, brand, size, and color. So I start my decision process from a budget:

If the price is over $30, I won't buy it. Otherwise, I will.
Once I find something under $30, I want it to be from a liked brand. If it is, I carry on in the decision process.
Now, does it fit me, my size? If so, we keep going.
Finally, if the shirt under $30, brand X, size S is black, I will take it, otherwise, I can either keep looking or finish my decision process with "I won't buy it".

Decision Tree Process. Image by the author.

This process is so logical and simple, that it can be applied to all kinds of data. The downside of this algorithm it that it is very sensitive to dataset changes, especially small datasets. Therefore, it can easily learn the small variances of the data and overfit your machine learning model.

This quality of DTs can be a threat to predictions but is exactly what we want to take advantage of during our Exploratory Data Analysis process.

In this post, we will learn how to use the power of DT to extract better insights from our data. Let's move on.

What is EDA?

Exploratory Data Analysis, or EDA for short, is the phase in a Data Science project where we take the dataset and explore its variables, trying to learn as much as possible what influences the most the target variable.

In this phase, the data scientist wants to understand the data, how it is distributed, if there are errors or incomplete data, extract the first insights of the data, and visualize and learn how each explanatory variable affects the target variable.

Using Decision Trees in the Process

Due to the power of a DT to capture the smallest variances in the data, using it helps to understand the relationships between the variables. As we are just exploring the data here, we don't have to be very careful with data split or algorithm fine-tuning. We just have to run a DT to get the best insights.

Let's see how to do that.

The dataset

The dataset to be used in this exercise is the Student Performance, from the UCI Repository, created by Paulo Cortez. This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

# Importing libraries
import pandas as pd
import seaborn as sns
sns.set_style()
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.tree import plot_tree

# Loading a dataset
from ucimlrepo import fetch_ucirepo

# fetch dataset
student_performance = fetch_ucirepo(id=320)

# data (as pandas dataframes)
X = student_performance.data.features
y = student_performance.data.targets

# Gather X and Y for visualizations
df = pd.concat([X,y], axis=1)

df.head(3)

View of the dataset. Image by the author.

We intend to determine which variables from this data have more influence over the final grade G3.

Exploring with a Regression DT

Now let's build a DT to check what is the influence of the failures, absences and studytime over G3.

# Columns to explore
cols = ['failures', 'absences', 'studytime']

# Split X & Y
X = df[cols]
y = df.G3

# Fit Decision Tree
dt = DecisionTreeRegressor().fit(X,y)

# Plot DT
plt.figure(figsize=(20,10))
plot_tree(dt, filled=True, feature_names=X.columns, max_depth=3, fontsize=8);

This is the Decision Tree yielded.

Now we have a good visualization to understand the relationship of those variables we listed. Here are the insights we can get from this tree:

We know that the left means "Yes" and right means "No" to the condition of the first line inside of each box.
Students with less failures (< 0.5, or zero, we should say), have higher grades. Just observe that the value of each box on the left is higher than those on the right.
From the students with no failures, those with studytime > 2.5 get higher grades. The value is almost one point higher.
Students with no failures, studytime < 1.5, and less than 22 absenses have higher final grades than those with a small amount of study time and higher absences.

Free Time and Go Out

If we want to explore which students have higher grades based on the amount of free time and how frequently they go out, here's the code.

Python"># Columns to explore
cols = ['freetime', 'goout']

# Split X & Y
X = df[cols]
y = df.G3

# Fit Decision Tree
dt = DecisionTreeRegressor().fit(X,y)

# Plot DT
plt.figure(figsize=(20,10))
plot_tree(dt, filled=True, feature_names=X.columns, max_depth=3, fontsize=10);

Decision Tree Go Out / Free Time. Image by the author.

The variables goout and freetime are scaled from 1= Very Low to 5 = Very High. Notice that those who don't go out frequently (< 1.5) and don't have free time (<1.5) have as low grades as those who go out a lot (>4.5) and with a fair amount of free time. The best grades are from the people balanced between going out > 1.5 and free time in the 1.5 to 2.5 range.

Exploring with a Classification DT

The same exercise can be done with a Classification Tree algorithm. The logic and coding are the same, but now the resulting value shown is the class predicted, instead of a value. Let's see a simple example using another dataset, Taxis from the Seaborn package (BSD License), that brings a set of Taxi runs in New York city.

If we want to explore the relationship between the total amount of the run and the payment method, here's the code.

# Load the dataset
df = sns.load_dataset('taxis').dropna()

# Columns to explore
cols = ['total']

# Split X & Y
X = df[cols]
y = df['payment']

# Fit Decision Tree
dt = DecisionTreeClassifier().fit(X,y)

#Plot Tree
plt.figure(figsize=(21,10))
plot_tree(dt, filled=True, feature_names=X.columns, max_depth=3, 
          fontsize=10, class_names=['cash', 'credit_card']);

Taxi Total cost vs payment method. Image by the author.

Just eyeballing the resulting tree, we can see that the lower total amounts are much more likely to be paid in cash. Totals under $9.32 are in general paid in cash.

Cool, isn't it?

Before You Go

In this tutorial, we learned a quick way to use Decision Trees to explore the relationship between variables in our dataset.

This algorithm can quickly capture patterns that are not easily found at first. We can use the power of decision trees to find those cuts of the data, enabling us to extract great insights from it.

And a quick note about the code: in the plot_tree() function, you can set how many levels you want using the max_depth feature. You can also set up that hyperparameter in the DT instance from sklearn. It's up to you. The advantage of using it on the plot_tree is that you can quickly test many different depths without the need to re-train the model.

plot_tree(dt, filled=True, feature_names=X.columns, max_depth=3);

If you liked this content, follow me for more.

Gustavo Santos – Medium

Find me on LinkedIn. Let's connect.

References

Decision tree – Wikipedia

UCI Machine Learning Repository

GitHub – mwaskom/seaborn-data: Data repository for seaborn examples

DecisionTreeRegressor

A good reference I'd like to mention: I learned this technique from this great Brazilian data scientist, Teo Calvo. He has an awesome free program with daily lives in his channel Teo Me Why. If you speak Portuguese, learn more about his work.