Effectively Optimize Your Regression Model with Bayesian Hyperparameter Tuning

Author:Murphy | View: 21912 | Time: 2025-03-23 18:05:34

Gradient boosting techniques such as XGBoost, CatBoost, and LightBoost has gained much popularity in recent years for both classification and regression tasks. An important part of the process is the tuning of hyperparameters to gain the best model performance. The key is to optimize the Hyperparameter search space together with finding a model that can generalize on new unseen data. In this blog, I will demonstrate 1. how to learn a boosted decision tree regression model with optimized hyperparameters using Bayesian optimization, 2. how to select a model that can generalize (and is not overtrained), 3. how to interpret and visually explain the optimized hyperparameter space together with the model performance accuracy. The HGBoost library is ideal for this task which performs, among others a double loop cross-validation to protect against overtraining.

A brief introduction.

Gradient boosting algorithms such as Extreme Gradient Boosting (XGboost), Light Gradient Boosting (Lightboost), and CatBoost are powerful ensemble machine learning algorithms for predictive modeling (classification and regression tasks) that can be applied to data sets in the form of tabular, continuous, and mixed forms [1,2,3 ]. Here I will focus on the regression task. In the following sections, we will train a boosted decision tree model using a double-loop cross-validation loop. We will carefully split the data set, set up the search space, and perform Bayesian optimization using the library Hyperopt. After training the model, we can deeper interpret the results by creating insightful plots.

If you need more background or are not entirely familiar with these concepts, I recommend reading this blog:

A Guide to Find the Best Boosting Model using Bayesian Hyperparameter Tuning but without…

If you need a hands-on guide for the classification task, I recommend reading this blog:

Hands-on Guide for Hyperparameter Tuning with Bayesian Optimization for Classification Models.

Before we go to the hands-on example, I will first briefly discuss the HGBoost library [4] .

The Steps Towards a Hyperoptimized Regression Model.

There are multiple steps required to train a regression model with optimized hyperparameters and to prevent creating an overtrained model. The first three steps will form the basis, and be the input to the HGboost model. Let's go step-by-step go through the steps.

Training a model with hyperoptimized parameters is time-costly, and complexer compared to not hyperoptimized models. To prevent adopting overtrained models, it requires making (sanity) checks.

Import and initialize the HGBoost library.
Import data set, and pre-processing.
Decide and select the most appropriate evaluation metric.

The hgboost library takes care of all the following steps:

Split the data into a train set, test set, and validation set.
Create a double-loop cross-validation model where the inner loop is to optimize the hyperparameters and an outer loop for validation and scoring the models.
Select the best-performing model.
Evaluate the best-performing model on the independent validation set.
Learn the final model on the entire dataset.
Create insightful plots for model and search space evaluation.

The final step is the interpretation of the results. Let's go through each of the steps in the next section.

Step 1. Import and initialize.

We can install HGBoost using pip with the command:

Python"># Installation
pip install hgboost

After the installation, we can import and initialize a model for a Regression task. The input parameters can be changed accordingly or the defaults (code section 1) can be used. I will set max_eval=1000 iterations. The next step is to read (or import) the data set.

# Import the library
from hgboost import hgboost

# Initialize library.
hgb = hgboost(
    max_eval=1000,      # Search space is based  on the number of evaluations.
    threshold=0.5,     # Classification threshold. In case of two-class model this is 0.5.
    cv=5,              # k-folds cross-validation.
    test_size=0.2,     # Percentage split for the testset.
    val_size=0.2,      # Percentage split for the validationset.
    top_cv_evals=10,   # Number of top best performing models that is evaluated.
    is_unbalance=True, # Control the balance of positive and negative weights, useful for unbalanced classes.
    random_state=None, # Fix the random state to create reproducible results.
    n_jobs=-1,         # The number of CPU jobs to run in parallel. -1 means using all processors.
    gpu=False,         # Compute using GPU in case of True.
    verbose=3,         # Print progress to screen.
)

Step 2. Reading and preprocessing the data set.

For demonstration, let's use the data science salary data set [6] that can be imported using the function import_example(data='ds_salaries'). This dataset contains 4134 samples and 11 features. I will set salary_in_usd **** as the _target value. For preprocessing, I will use the internal .preprocessing() function ** (code section 2) which utilizes the df2onehot libra**r_y [5] that in its turn encodes the categorical values into one-hot. Note that continuous values are not encoded but remain untouched. At this point, it is strongly recommended to do the Exploratory Data Analysis (EDA), and make sanity checks.

The largest model performance gains typically follow from pre-processing, and feature engineering.

After the preprocessing step, there are 4134 rows x 198 columns. Be aware that the pre-processing steps in this example are based under the assumption that we are going to train a XGboost model. Different models (Xgboost, Lightboost, Adaboost) require different preprocessing steps. As an example, the one-hot encoding is required for XGBoost whereas other models can process non-numeric factors. Read this blog for more information about the (dis)advantages, and preprocessing steps.

Step 3. Set the hyper-parameter search space.

The search space for hyperparameter optimization is defined in HGBoost and slightly differs between the models XGBoost, LightBoost, and CatBoost (code section 3). Note that the search space (code section below) is pre-defined in HGboost.

#########################################
# You do not need to execute these lines!
#########################################

# XGBoost
xgb_reg_params = {
    'learning_rate': hp.quniform('learning_rate', 0.05, 0.31, 0.05),
    'max_depth': hp.choice('max_depth', np.arange(5, 30, 1, dtype=int)),
    'min_child_weight': hp.choice('min_child_weight', np.arange(1, 10, 1, dtype=int)),
    'gamma': hp.choice('gamma', [0, 0.25, 0.5, 1.0]),
    'reg_lambda': hp.choice('reg_lambda', [0.1, 1.0, 5.0, 10.0, 50.0, 100.0]),
    'subsample': hp.uniform('subsample', 0.5, 1),
    'n_estimators': hp.choice('n_estimators', range(20, 205, 5)),
}

# LightBoost
lgb_reg_params = {
    'learning_rate': hp.quniform('learning_rate', 0.05, 0.31, 0.05),
    'max_depth': hp.choice('max_depth', np.arange(5, 30, 1, dtype=int)),
    'min_child_weight': hp.choice('min_child_weight', np.arange(1, 8, 1, dtype=int)),
    'subsample': hp.uniform('subsample', 0.8, 1),
    'n_estimators': hp.choice('n_estimators', range(20, 205, 5)),
}

# CatBoost
ctb_reg_params = {
    'learning_rate': hp.quniform('learning_rate', 0.05, 0.31, 0.05),
    'max_depth': hp.choice('max_depth', np.arange(2, 16, 1, dtype=int)),
    'colsample_bylevel': hp.choice('colsample_bylevel', np.arange(0.3, 0.8, 0.1)),
    'n_estimators': hp.choice('n_estimators', range(20, 205, 5)),
}

Step 4. Train/ Test/ Evaluation sets and Evaluation Metrics.

For supervised machine learning tasks, it is important to split the data to avoid overfitting when learning the model. Overfitting is when the model fits (or learns) the data too well and then fails to predict (new) unseen data reliably.

Without splitting the data carefully, you are at risk to overfit the model parameters and fit (or learn) the data too well. It can then easily fail to correctly predict new (unseen) data samples.

In the training process, the data set is divided into a training set, a testing set, and, an independent validation set. The validation set remains untouched during the entire training process. It is used only once to evaluate the final model performance. Splitting the dataset is performed in percentages, such as test_size=0.2 and eval_size=0.2. The model evaluation metric can be set using the eval_metric. The default evaluation metric is set to Root Mean Squared Error (RMSE), but other flavors such as the Mean Squared Error (MSE) or Mean Absolute Error (MAE) can also be used.

Step 5. Fit, Optimize Hyperparameters, and Select The Best Model.

At this point, we preprocessed the data for XGBoost and decided which evaluation metric to use. We can now start fitting a model and optimizing the hyperparameters. The first step in HGBoost for the hyperparameter optimization is setting up the inner loop for optimizing the hyperparameters using Bayesian optimization and, the outer loop to test how well the best-performing models can generalize using k-fold cross-validation. The search space depends on the available hyperparameters. The evaluation metric is set to MAE because of the good interpretability. Or in other words, if we would see MAE=10000 with salary as the target value, it depicts that on average, the predictive distance from the true value is 10000 USD.

##########################################
# Train model
##########################################

results  = hgb.xgboost_reg(X, y, eval_metric='mae')      # XGBoost
# results = hgb.lightboost_reg(X, y, eval_metric='mae')  # LightBoost
# results = hgb.catboost_reg(X, y, eval_metric='mae')    # CatBoost

# [hgboost] >Start hgboost regression.
# [hgboost] >Collecting xgb_reg parameters.
# [hgboost] >method: xgb_reg
# [hgboost] >eval_metric: mae
# [hgboost] >greater_is_better: False
# [hgboost] >*********************************************************************************
# [hgboost] >Total dataset: (4134, 198) 
# [hgboost] >Validation set: (827, 198) 
# [hgboost] >Test-set: (827, 198) 
# [hgboost] >Train-set: (2480, 198) 
# [hgboost] >*********************************************************************************
# [hgboost] >Searching across hyperparameter space for best performing parameters using maximum nr. evaluations: 1000
# 100%|██████████| 1000/1000 [04:14<00:00,  1.02s/trial, best loss: 35223.92493340009]
# [hgboost]> Collecting the hyperparameters from the [1000] trials.
# [hgboost] >[mae]: 35220 Best performing model across 1000 iterations using Bayesian Optimization with Hyperopt.
# [hgboost] >*********************************************************************************
# [hgboost] >5-fold cross validation for the top 10 scoring models, Total nr. tests: 50
# [hgboost] >[mae] (average): 35860 Best 5-fold CV model using optimized hyperparameters.
# [hgboost] >*********************************************************************************
# [hgboost] >Evaluate best [xgb_reg] model on validation dataset (827 samples, 20%)
# [hgboost] >[mae]: 35270 using optimized hyperparameters on validation set.
# [hgboost] >[mae]: 35540 using default (not optimized) parameters on validation set.
# [hgboost] >*********************************************************************************
# [hgboost] >Retrain [xgb_reg] on the entire dataset with the optimal hyperparameters.
# [hgboost] >Fin!

By running the above code section, we iterated across the search space and created 1000 different models (max_eval=1000) for which the performances are evaluated. Next, the models are ranked on their initial model performance. We can now evaluate the robustness of the top K best-performing models top_cv_evals=10 using the 5-fold cross-validation scheme (cv=5). In this sense, we aim to prevent finding an overtrained model. In this example, the best scoring model across the 1000 iterations scored MAE=35220, while the average MAE based on the 5-fold cross-validation was MAE=35860. The MAE on the independent validation set is 35270 using the best model from the 5-fold CV. Although we do not select the best scoring model, we do prevent taking a model that is subject to being overfitted.

The best model is not necessarily the one with the best model performance but the one that can generalize on (new) unseen data, providing accurate predictions.

We now have a model that is ready to make predictions on new data. But before making predictions, let's first try to understand what happened in all the steps above.

Step 6. Interpretation of the results.

At this point, we have a trained model and briefly looked at the regression results based on the cross-validation, and using the independent validation set. All tested hyperparameters for the different models are returned, which can be deeper examined _ (code section below)_. See also Figure 1 where the summary of all model results results['summary'] is shown in a data frame. In addition, we can also examine the results by making plots.

# Results are stored in the object and can be found in:
print(hgb.results)

# Results are also returned by the model:
print(results.keys())

# ['params', 'summary', 'trials', 'model', 'val_results', 'comparison_results']

###########################################################################################
# The params contains the parameters to create the best performing model.
print(results['params'])

# {'gamma': 1.0,
# 'learning_rate': 0.1,
# 'max_depth': 25,
# 'min_child_weight': 1,
# 'n_estimators': 55,
# 'reg_lambda': 1.0,
# 'subsample': 0.5522011244420446}

###########################################################################################
# The summary contains the model evaluations.
print(results['summary'])

#     gamma gpu_id learning_rate  ... best_cv loss_validation default_params
# 0     0.5      0           0.1  ...     0.0             NaN          False
# 1     1.0      0           0.3  ...     0.0             NaN          False
# 2       0      0           0.1  ...     0.0             NaN          False
# 3     1.0      0          0.05  ...     0.0             NaN          False
# 4    0.25      0           0.3  ...     0.0             NaN          False
# ..    ...    ...           ...  ...     ...             ...            ...
# 246   1.0      0          0.15  ...     0.0             NaN          False
# 247  0.25      0           0.1  ...     0.0             NaN          False
# 248   1.0      0          0.05  ...     0.0             NaN          False
# 249   0.5      0           0.2  ...     0.0             NaN          False
# 250  None    NaN          None  ...     NaN    35539.674538           True

# [251 rows x 20 columns]
###########################################################################################

# The trials contains the object from hyperopt in case you want to further investigate.
print(results['trials'])
#

Figure 1. The output of the results[‘summary'] data frame contains all model results. (image from the author)

Creating plots is important to deeper examine the model performance and to investigate how the hyperparameters are tuned during the Bayesian optimization process.

Gaining intuition with the model results is important to know whether the model parameters are choosen reliable.

We can create the following plots using the built-in functionalities of HGBsoost:

Plots to investigate the hyperparameter space.
Plot to summarize all evaluated models.
Plot the performances for the cross-validations.
Plot the results of the independent validation set.
Plot the decision tree to understand how features are used.

# Plot the hyperparameter tuning.
hgb.plot_params()

# Plot the summary of all evaluted models.
hgb.plot()

# Plot results on the validation set.
hgb.plot_validation()

# Plot results on the k-fold cross-validation.
hgb.plot_cv()

# Plot the best performing tree.
hgb.treeplot()

Interpretation of the Hyperparameter Tuning.

Let's start by investigating how the hyperparameters are tuned during the Bayesian Optimization process. With the function .plot_params() we can create insightful plots as depicted in Figure 2. This figure contains multiple histograms (or kernel density plots), where each subplot contains a single parameter that is optimized during the 1000 model iterations. The small bars at the bottom of the histogram depict the 1000 evaluations. In contrast, the black dashed vertical lines depict the specific parameter value that is used across the top 10 best-performing models. The green dashed line depicts the best-performing model without the cross-validation approach, and the red dashed line depicts the best-performing model with cross-validation.

Let's have a look at Figure 2A. In the left bottom corner of the figure, there is the parameter subsample for which the values range from 0.5 up to 1. There is a clear peak around 0.95 which indicates that the Bayesian optimization explored these regions more intensively. Our best-performing model seems to use subsample=0.956 (red dashed line). But here is more to look at. When we now look at Figure 2B, we also have the subsample for which each dot is one of the 1000 models. The horizontal axes are the iterations and the vertical axis is the optimized values. For this parameter, there is a clear trend during the iterations of the optimization process. It first explored the lower bound regions and then moved towards the upper bound regions because models scored apparently better when increasing the subsample value.

In this manner, all the hyperparameters can be interpreted in relation to the different models.

Figure 2. Tuned hyperparameters during the 1000 iterations for the regression model. A. The distribution of the seven parameters. B. Each dot is one iteration for which a particular value is selected. Note that this image is only for illustration and not the real results. (image from the author)

Interpretation of the model performance across all evaluated models.

With the plot function .plot() we create insights into the performance (MAE in this case) of the 1000 models (Figure 3). The green dashed line depicts the best-performing model without the cross-validation approach, and the red dashed line depicts the best-performing model with cross-validation. Our selected model is the one with the red dashed line. In general, we can see a trend that model performance improves during the iterations as the MAE scores get lower. This indicates that Bayesian optimization works very well. Furthermore, when we look at the top 10 scoring models (the red squared ones), their performance is slightly worse during the k-fold cross-validation. Or in other words, the best model during the 1000 iterations is depicted with the green dashed line but in the CV it does not perform the best. Therefore, a model that can better generalize from the top 10 is selected (red dashed line), aka the one with the lowest average MAE from the k-fold CV. In this manner, we aim to select the best-performing model that can also generalize.

Figure 3. Model performance for the 1000 evaluations and cross-validation. (image from the author)

Interpretation of the model performance on the independent validation set.

To make sure the model can generalize on unseen data, we use the independent validation set. Figure 4 depicts the regression line and the predictions do not show strong outliers or a consistent over or underestimation. An underestimation is seen for the low salaries as the model predicted the salaries of those cases were predicted less than the real salary.

Figure 4. Results on the independent validation set. (image from the author)

To deeper investigate how our final model generalizes, we can plot the results for the 5-fold cross-validation using the function .plot_cv(). This will create the ROC curves for the different folds as shown in Figure 5. Here we can see that the slope of the model, across the different folds, is more or less similar. Thus our final model does not show irregularities on the k-fold CV.

Figure 5. Results for the 5-fold CV using the model with optimized hyperparameters. (image from the author)

Decision Tree plot of the best model.

With the decision tree plot (Figure 6) we can get a better understanding of how the model makes the decision. It may also give some intuition whether such a model can generalize over other datasets. Note that the best tree is returned by default num_tree=0but many trees are created that can be returned by specifying the input parameter .treeplot(num_trees=1) . In addition, we can also plot the best-performing features as depicted in Figure 7. The features Remote, work year, and experience level are the top 3 features that are important in the prediction of the salary.

Figure 6. Decision tree plot for the best model. (image from the author)

Figure 7. Feature importance. (image from the author)

Make Predictions on new data

After having the final trained model, we can now use it to make predictions on new data. Suppose that X is new data, and is similarly pre-processed as in the training process, then we can use the .predict(X) function to make predictions. This function returns the classification probability and the prediction label (code section 5).

# Make new predictions using the model. Suppose the X is new and unseen data.
y_pred, y_proba = hgb.predict(X)

Save and load model

Saving and loading models can become handy. In order to accomplish this, there are two functions: .save() and function .load().

# Save model
status = hgb.save(filepath='hgboost_model.pkl', overwrite=True)
# [pypickle] Pickle file saved: [hgboost_model.pkl]
# [hgboost] >Saving.. True

# Load model
from hgboost import hgboost                      # Import library when using a fresh start
hgb = hgboost()                                  # Initialize hgboost
results = hgb.load(filepath='hgboost_model.pkl') # Load the pickle file with model parameters and trained model.
# [pypickle] Pickle file loaded: [hgboost_model.pkl]
# [hgboost] >Loading succesful!

# Make predictions again
y_pred, y_proba = hgb.predict(X)

Wrapping up.

I demonstrated how to train a regression model with optimized hyperparameters by splitting the dataset into a train, test, and independent validation set. Within the train-test set, there is the inner loop for optimizing the hyperparameters using Bayesian optimization (with hyperopt) and, the outer loop to score how well the top-performing models can generalize based on k-fold cross-validation. In this manner, we aim to select the model that can generalize with the best accuracy. Note that in our example, the best-scoring model seems to have (more or less) a similar score to a model with default parameters. This strengthens my point that an important part before training any model is to do the typical modeling workflow: Start with the Exploratory Data Analysis (EDA), iteratively do the cleaning, feature engineering, and feature selection. The largest performance gains typically follow from these steps.

The HGBoost library also supports learning classification models, multi-class models, and even an ensemble of boosting tree models. __ For all tasks, the same double-loop cross-validation scheme is applied to make sure the best-performing and most robust model is selected. Furthermore, _HGBoos_t utilizes the HyperOpt [7, 8] library for the Bayesian optimization algorithm which is one of the most popular libraries for hyperparameter optimization.

Be safe. Stay frosty.

Cheers, E.

If you find this article helpful, you are welcome to follow me because I write more about model training and optimizations. If you are thinking of taking a Medium membership, you can support my work a bit by using my referral link. It is the same price as a coffee but allows you to read unlimited articles monthly.