10 Decision Trees are Better Than 1

Author:Murphy | View: 23387 | Time: 2025-03-23 19:36:33

This is the 2nd article in a series on decision trees. In the last post, we introduced decision trees and discussed how to grow them using data. While they are an intuitive machine learning approach, decision trees are prone to overfitting. Here we discuss a solution to this overfitting problem via decision tree ensembles.

Key points:

Decision Tree ensembles combine several decision trees into a single estimator
Tree ensembles are less prone to overfitting than a single decision tree

Decision trees

In the previous article of this series, I reviewed decision trees and how we can use them to make predictions. However, for many real-world problems, a single decision tree is often prone to bias and overfitting.

We saw this in our example from the last blog, where even after a little hyperparameter tuning, our decision tree was still wrong 35% of the time.

Decision Trees: Introduction & Intuition

A solution to this poor performance problem is to use an ensemble of decision trees rather than just one. The key benefit of tree ensembles is they generally have better performance than a single decision tree.

Qualitative visual comparison of single decision tree vs tree ensembles. Image by author.

2 Types of Tree Ensembles

While there are many ways we could combine a set of decision trees to improve performance, two popular methods are bagging and boosting. These are the seeds behind popular techniques such as Random Forest, AdaBoost, and XGBoost.

1) Bagging (Random Forest)

Bagging is short for "Bootstrap AGGregation." The way it works is you train a set of decision trees (one at a time) by randomly sampling your training dataset with replacement.

Put another way, a set of training datasets, {Tᵢ}, is derived from an initial training dataset, T₀, by randomly sampling T₀ with replacement. Then, each dataset in {Tᵢ} is used to grow a decision tree, forming a set of decision trees. This set forms a decision tree ensemble whose predictions can be aggregated (via mean, median, etc.) to obtain better prediction performance.

Visual overview of bagging process. Image by author.

Intuitively, this works because it avoids your model over-optimizing to any single dataset, which helps avoid the overfitting problem and improves model performance.

Random Forest – One of the most popular bagging-based techniques out there is Random Forest [1]. While I won't get into all the details of Random Forest here, there are 2 details of the technique worth mentioning.

First, beyond the random sampling with replacement, each decision tree is trained on a random subset of the available features, which helps reduce the correlation between trees in the forest and ultimately reduces generalization error.

Second, the algorithm has internal estimates of ("out-of-bag") error for each tree, which are used to define a feature importance ranking based on error rate as opposed to information gain (which is used in other methods). The point may seem subtle and pedantic, but this may be a more desirable way to define importance since the error is what we ultimately care about, not so much information gain which is indirectly related to the error.

2) Boosting (AdaBoost, Gradient Boosting, XGBoost)

Boosting leverages the idea of learning from your mistakes. It works by iteratively training decision trees via an error-based data re-weighting scheme.

Walking through this, we start with a single training dataset, say T₀, and use it to grow a single decision tree. Afterward, the training dataset is re-weighted so that records with incorrectly predicted targets receive more weight. The new training dataset, say T₁, then trains another decision tree.

Re-weighting and training are repeated until you have a set of decision trees. In a similar manner to bagging, predictions from this tree ensemble can be aggregated (e.g. weighted average) to improve model performance.

Visual overview of boosting process. Image by author.

Notice, rather than randomly sampling the training data (like with bagging), the training data are non-randomly "re-sampled" based on errors. An intuitive way to think about this is that the outliers and edge cases might be missed in the early decision trees, but as the boosting process continues, these atypical records get more weight in subsequent training sets, which forces the later decision trees to label them correctly.

AdaBoost – The 1st practical boosting-based technique was AdaBoost (short for Adaptive Boosting). This uses a boosting strategy following the intuitive description given above, with a particular re-weighting rule [2]. Later, the boosting strategy used in AdaBoost was placed into a more general framework known as gradient boosting.

Boosting vs Gradient Boosting – In a sense, boosting and gradient boosting are one and the same. Boosting commonly refers to a strategy similar to AdaBoost, where decision trees are iteratively trained from a single dataset using an error-based data re-weighting rule.

Gradient boosting is, in a sense, a generalization of this. It places boosting in a larger mathematical framework. One notable detail is gradient boosting does not generate subsequent trees through re-weighting the training data but rather through a gradient-based model update rule [3].

Nevertheless, Adaboost can be recast into this general gradient boosting framework where the model update rule is the gradient of the exponential loss function [4].

XGBoost – A scalable implementation of the gradient boosting idea is XGBoost. It works via approximations that make gradient boosting more computationally efficient. Additional (technical) details can be found in the original paper by Chen and Guestrin [5].

3 Benefits of Tree Ensembles

The key feature of tree ensembles that makes them advantageous is the power of population estimates. What I mean by this is estimates that are based on a collection of models rather than just one.

Going from point estimates to population estimates. Image by author.

As I mentioned earlier, the main benefit of this is improving model performance and curbing overfitting. However, there are a couple of other benefits of tree ensembles.

1) No more overfitting (kind of)

While overfitting is always a concern when developing data-driven models, tree ensembles are much more resilient to overfitting than single decision trees. This result comes more from experiment and practice rather than a rigorous theory. Moreover, this seems to stand in stark contrast to the traditional thinking that more "complex" models are more prone to overfitting.

There are many ideas out there trying to explain "why" ensembles work so well. A flavor of this is given in the discussion by Schapire [2]. For now, however, this seems to be an open question.

2) More robust feature importance rankings

"Importance" is another quantity that benefits from using ensembles. Since each ensemble member is trained in a slight way, this results in a more robust quantification of information gain (or out-of-bag error [1] or Friedman's influence quantity [3]).

3) Forget about point-estimates

Although the final output of a tree ensemble is a single prediction (e.g. true or false), under the hood, several estimates go into this final output. Put another way, each prediction made by a tree ensemble corresponds to a distribution of estimates, not a point estimate.

The upside of this is we can quantify things like model confidence or uncertainty. Which can serve as an additional window to evaluating model performance and reliability.

Example Code: Breast Cancer Prediction Using Tree Ensembles

Now we turn to a concrete example of using tree ensembles to do breast cancer detection. In this example, we compare the performance of 4 different techniques. Namely, a single decision tree, Random Forest, AdaBoost, and Gradient-boosted Trees.

We use the sklearn Python library for all the ML stuff, and the dataset is originally from the UCI Machine Learning repository [6,7]. The code for this example is freely available in the GitHub repository.

We first import our Python libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer # toy dataset (sorry)

from imblearn.over_sampling import SMOTE

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import precision_score, recall_score, f1_score

Here we use a toy dataset readily available via sklearn. We can import the ready-to-use data with a single line of code.

df = load_breast_cancer(as_frame = True)['frame']

It's a good practice to plot histograms for each variable in the dataset. This is easily done using the hist() method in pandas.

plt.rcParams.update({'font.size': 16})

df.hist(figsize=(24,16))
plt.savefig('visuals/raw_histograms.png',facecolor='white',bbox_inches="tight")

Histograms from each variable in our dataset. Image by author.

Notice we have many more variables (31 total) than in our previous example. Additionally, the dataset is once again imbalanced since we have more benign (target=1) than malignant (target=0) tumors. Unlike the previous example, however, here we won't do anything about the imbalance and will observe how the 4 different models hold up.

Next, we define the training and testing datasets. Here we use every available predictor variable and an 80–20 train-test split.

# define predictor and target variable names
X_var_names = df.columns[:df.shape[1]-1]
y_var_name = df.columns[df.shape[1]-1]

# create predictor and target arrays
X = df[X_var_names]
y = df[y_var_name]

# create train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Now, we can train our models. Since I refuse to copy-paste code, here I do some "automatic code generation" to train each model and execute it using the exec() command. Here's what that looks like.

clf_list = []
model_name_list = ['DecisionTreeClassifier', 'RandomForestClassifier', 'AdaBoostClassifier', 'GradientBoostingClassifier']

# train model
for model_name in model_name_list:

    # dynamically write command to train each classifier in model_name_list
    cmd = 'clf = ' + model_name + '(random_state=0)'
    print(cmd)    # print command
    exec(cmd)     # execute command

    # fit data to model
    clf = clf.fit(X_train, y_train)

    # append classifier to list
    clf_list.append(clf)

# here's what get's printed
# >>> clf = DecisionTreeClassifier(random_state=0)
# >>> clf = RandomForestClassifier(random_state=0)
# >>> clf = AdaBoostClassifier(random_state=0)
# >>> clf = GradientBoostingClassifier(random_state=0)

Finally, I again use automatic code generation to compute each model's precision, recall, and f1-score on the training and testing datasets, then organize the results in a pandas dataframe.

performance_dict = {}

dataset_name_list = ['train', 'test']
metric_name_list = ['precision', 'recall', 'f1']

# evaluate metrics for each model
for i in range(len(model_name_list)):

    clf = clf_list[i]

    prec_val_list = []
    col_name_list = []

    # evalaute metrics for each dataset
    for dataset_name in dataset_name_list:

        cmd = 'y_pred_' + dataset_name + ' = clf.predict(X_' + dataset_name + ')'
        print(cmd)
        exec(cmd)

        # evaluate each metric for given model and dataset      
        for metric_name in metric_name_list:

            cmd = 'prec_val_list.append(' + metric_name + '_score(y_' + dataset_name + ', y_pred_' + dataset_name + '))'
            print(cmd)
            exec(cmd)

            col_name_list.append(metric_name + '_' + dataset_name)
        print('')

    performance_dict[model_name_list[i]] = prec_val_list

df_performance = np.round(pd.DataFrame.from_dict(performance_dict, orient='index', columns=col_name_list),3)

# here's what get's printed (for just one model)
# >>> y_pred_train = clf.predict(X_train)
# >>> prec_val_list.append(precision_score(y_train, y_pred_train))
# >>> prec_val_list.append(recall_score(y_train, y_pred_train))
# >>> prec_val_list.append(f1_score(y_train, y_pred_train))

# >>> y_pred_test = clf.predict(X_test)
# >>> prec_val_list.append(precision_score(y_test, y_pred_test))
# >>> prec_val_list.append(recall_score(y_test, y_pred_test))
# >>> prec_val_list.append(f1_score(y_test, y_pred_test))

# ... and so on for each model

The final results are summarized in the table above. We can see all models perform "perfectly" on the training dataset. However, the real test is from the testing data.

As a rule of thumb, the difference between training and testing data performance is indicative of overfitting. Thus, the further the testing metrics are from 1, the more overfitting.

With this view, the lone decision tree classifier suffers the most from overfitting (as expected). Random Forest and Gradient Boosting perform best, with AdaBoost following closely behind.

Get the code from this example here

Tags: Data Science Decision Tree Machine Learning Python Tree Ensemble