Ensemble Learning with Scikit-Learn: A Friendly Introduction

Author:Murphy | View: 27387 | Time: 2025-03-23 12:45:28

Fundamental learning algorithms as logistic regression or linear regression are often too simple to achieve adequate results for a machine learning problem. While a possible solution is to use neural networks, they require a vast amount of training data, which is rarely available. Ensemble Learning techniques can boost the performance of simple models even with a limited amount of data.

Imagine asking a person to guess how many jellybeans there are inside a big jar. One person's answer will unlikely be a precise estimate of the correct number. Instead, if we ask a thousand people the same question, the average answer will likely be close to the actual number. This phenomenon is called the _wisdom of the crowd_ [1]. When dealing with complex estimation tasks, the crowd can be considerably more precise than an individual.

Ensemble learning algorithms take advantage of this simple principle by aggregating the predictions of a group of models, like regressors or classifiers. For an aggregation of classifiers, the ensemble model could simply pick the most common class between the predictions of the low-level classifiers. Instead, the ensemble can use the mean or the median of all the predictions for a regression task.

By aggregating a large number of weak learners, i.e. classifiers or regressors which are only slightly better than random guessing, we can achieve unthinkable results. Consider a binary classification task. By aggregating 1000 independent classifiers with individual accuracy of 51% we can create an ensemble achieving an accuracy of 75% [2].

This is the reason why ensemble algorithms are often the winning solutions in many machine-learning competitions!

There exist several techniques to build an ensemble learning algorithm. The principal ones are bagging, boosting, and stacking. In the following sections, I briefly describe each of these principles and present the Machine Learning algorithms to implement them.

Bagging

The first technique to form an ensemble algorithm is called bagging, which is the abbreviation of bootstrap aggregating. The core idea is to provide each weak learner with a slightly different training set. This is done by randomly sampling the original training set.

If the sampling is done with replacement, the technique is called bagging, otherwise, if the sampling is done without replacement, the technique is properly called pasting.

The key idea of bagging (and pasting) is to create weak learners as diverse as possible, and this is done by training them with randomly generated training sets. As the sampling is performed with repetition, bagging introduces slightly more diversity than pasted and for this reason, is typically preferred.

Let's see how bagging works through an example. Suppose the original training set is made of 10 examples, and we want to build an ensemble consisting of 3 different weak learners. Moreover, we want to train each learner on a subset of the original training set of dimension 5. The following picture illustrates how the training set could be divided:

Pasting instead doesn't allow the repetition of the same training example in a model's training subset:

Random Forests

The most common bagging model is Random Forests. A Random Forest is an ensemble of decision trees, each one trained on a slightly different training set.

For each node of each decision tree in a Random Forest, the algorithm randomly selects the features between which to search for the best possible split. In other words, the algorithm doesn't look for the best possible split among all the features, but instead it searches for the best possible split among a subset of the available features. This is the explanation of the name "random".

The following snippet shows how to create and fit a Random Forest Classifier on some training data.

Some attributes are extremely important and I suggest tuning them through a grid search approach. n_estimators defines the number of weak learners, max_features is the number of features to be considered at each split, max_depth is the maximum depth of the trees. These hyperparameters play a fundamental role in regularization. Increasing n_estimators and min_sample_leaf or reducing max_features and max_depth helps to create trees as diverse as possible, and consequently to avoid overfitting.

For all the other attributes check the comprehensive documentation.

Extra Trees

In order to increase even more the randomness of the trees, we can use the Extremely Randomized Trees algorithm, Extra Trees for short. It works similarly to Random Forests, but at each node of each decision tree, the algorithm searches for the best possible split within random thresholds for each feature. Random Forest, instead, searches for the best possible split among the whole value range of the features.

Extra Trees has one additional random component with respect to Random Forest. For this reason, it trades more bias for a lower variance [2].

The second advantage of Extra Trees is that its training is faster than Random Forest as it doesn't have to look for the best possible split in the entire value domain of the selected features. Instead, it considers only a portion of it.

The implementation of an Extra Trees regressor or classifier with Scikit-Learn is as simple as Random Forest's implementation:

The parameters are the same as for the Random Forest model.

It is generally very difficult to predict whether Random Forest or Extra Trees work better for a given task. For this reason, I suggest training both of them and comparing them later.

Feature Importance

A nice aspect of both Random Forest and Extra Trees is that they provide a measure of the ability of each feature to reduce the impurity of the dataset: this is called feature importance. In other words, a feature with an high importance is able to provide better insights for the classification or regression problem than a feature with a low importance.

After training a model, we can access this measure with the attribute feature_importances_ .

Once the model is trained, Scikit-Learn automates the computation of the features' importance. The output score is scaled into the [0,1] range, meaning that scores closer to 1 are assigned to the most important features.

In the example above, the features ‘petal length' and ‘petal width' receive a substantially higher score than the others. While ‘sepal length' has moderate importance for our model, the feature ‘sepal width' appears to be useless for this classification task.

Boosting

Moving on to another noteworthy ensemble technique, we dive into the boosting techniques. This method stands as a potent counterpart to bagging in the world of machine learning.

Boosting, short for hypothesis boosting, follows an interesting philosophy. Instead of creating diverse learners through random sampling, boosting focuses on enhancing the performance of individual weak learners by training predictors sequentialy. Each predictor aims at correcting its predecessor. It's like a coach putting extra practice into a player's weak spots to make them an all-around star.

Boosting, with its emphasis on self-improvement and gradual refinement, often outperforms bagging techniques in scenarios where precision and attention to detail are predominant.

We'll now see the most famous and awarded boosting ensemble technique: XGBoost.

XGBoost

XGBoost, short for Extreme Gradient Boosting, is an ensemble learning method that employs gradient descent optimization and it is designed to improve the predictive performance of weak learners systematically.

In XGBoost, each weak learner is assigned a weight based on their error rate. Learners that perform poorly on specific instances are given higher weights to prioritize the corection of those errors. This iterative process continues, with the model adapting its focus to address the most challenging data points.

XGBoost is a favored choice among data scientists and machine learning practitioners for its ability to extract valuable insights and deliver superior predictive accuracy. It is also largely employed in ML competitions due to its speed and scalability.

Here we can see how to apply XGBoost to real data:

Without hyperparameters tuning, the model is able to achieve a 98% accuracy.

Conclusion

As we conclude this introductory journey on ensemble algorithms in machine learning, it's important to recognize that our journey has only laid the foundation for deeper and more specialized investigations. While we've touched upon some fundamental concepts, the field extends far beyond these introductory insights.

I recommend digging into the resources and references attached to this article. These sources offer detailed insights into advanced ensemble methodologies, algorithm optimizations, and practical implementation tips.

If you liked this story, consider following me to be notified of my upcoming projects and articles!

Here are some of my past projects:

Euro Trip Optimization: Genetic Algorithms and Google Maps API Solve the Traveling Salesman Problem

Building a Convolutional Neural Network from Scratch using Numpy