How to Do Cross-Validation Effectively

Author:Murphy | View: 26086 | Time: 2025-03-23 19:53:08

Cross-validation is a critical factor for building robust Machine Learning models. But, it is often not applied to its full potential.

In this article, we'll explore two important practices to get the most out of cross-validation: re-training and nesting.

Let's get started!

What is cross-validation?

Cross-validation is a technique for evaluating the performance of a model.

This process usually involves testing several techniques. Or doing hyperparameter optimization of a particular method. In such cases, your goal is to check which alternative is best for the input data.

The idea is to select the approach that maximizes performance. This is the model that will be deployed into production. Besides, you also want to get a reliable estimate of that model's performance.

Re-training After Cross-Validation

After cross-validation, you should re-train the best model using all available data. Image by author.

Suppose you do cross-validation to select a model. You test many alternatives using 5-fold cross-validation. Then, a linear regression comes out on top.

What should you do next?

Should you re-train the linear regression using all available data? or should you use the models trained during cross-validation?

This part creates some confusion among data scientists – not only among beginners but also among more seasoned professionals.

After cross-validation, you should re-train the best approach using all available data. Here's a quote taken from the legendary book Elements of Statistical Learning [1](parenthesis mine):

Our final chosen model [after cross-validation] is f(x), which we then fit to all the data.

But, this idea is not consensual.

Some practitioners keep the best models trained during cross-validation. Following the example above, you'd keep 5 linear regression models. Then, during the deployment stage, you'd average their predictions for each prediction.

That's not how cross-validation works.

There are two problems with this approach:

It uses fewer data for training;
It leads to increased costs due to having to maintain many models.

Fewer data

By not re-training, you're not using all available instances for creating a model.

This can lead to a sub-optimal model unless you have tons of data. Training with all available instances is likely to generalize better.

Re-training is especially important in time series because the most recent observations are used for testing. By not re-training in these, the model might miss newly emerged patterns.

Increased costs

One can argue that combining the 5 models trained during cross-validation leads to better performance.

Yet, it's important to understand the implications. You're no longer using a simple, interpretable, linear regression.

Your model is an ensemble whose individual models are trained by random subsampling. Random subsampling is a way of introducing diversity in ensembles. Ensembles often perform better than single models. But, they also lead to extra costs and lower transparency.

What if you just keep one, instead of combining all models?

That would solve the problem of increased costs. Yet, it's not clear which version of the model you should choose.

There are two reasons re-training can be skipped. If the data set is large or if re-training is too costly. These two issues are often linked.

Re-training – Practical example

Here's an example of how you can re-train the best model after cross-validation:

from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, KFold

# creating a dummy dataset
X, y = make_regression(n_samples=100)

# 5-fold cross-validation
cv = KFold(n_splits=5)

# optimizing the number of trees of a RF
model = RandomForestRegressor()
param_search = {'n_estimators': [10, 50, 100]}

# applying cross-validation with a grid-search
# and re-training the best model afterwards
gs = GridSearchCV(estimator=model, cv=cv, refit=True, param_grid=param_search)
gs.fit(X, y)

The goal is to optimize the number of trees in a Random Forest. This is done with the GridSearchCV class from scikit-learn. You can set the parameter refit=True, and the best model is re-trained after cross-validation automatically.

You can do this explicitly by getting the best parameters from GridSearchCV to initialize a new model:

best_model = RandomForestRegressor(**gs.best_params_)
best_model.fit(X, y)

Getting Reliable Performance Estimates

When developing a model, you want to achieve three things:

Select a model among many alternatives;
Train the selected model and deploy it;
Get a reliable estimate of the performance of the selected model.

Cross-validation and re-training cover the first two points, but not the third.

Why is that?

Cross-validation is often repeated several times before selecting a final model. You test different transformations and hyperparameters. So, you end up adjusting your method until you're happy with the result.

This can lead to overfitting because the details of the validation sets can leak into the model. Thus, the performance estimate you get from cross-validation can be too optimistic. You can read more about this in the article in reference [2].

This is one of the reasons why Kaggle competitions have two leaderboards, one public and another private. This prevents competitors from overfitting the test set.

So, how do you solve this problem?

You should make an extra evaluation step. After cross-validation, you evaluate the selected model in a held-out test set. The full workflow is like this:

Split the available data into training and testing sets;
Apply cross-validation with the training set to select a model;
Re-train the chosen model using the training data and evaluate it on the test set. This provides you with an unbiased performance estimate;
Re-train the chosen model using all available data and deploy it.

Here's a visual description of this process:

Applying cross-validation with training data. After cross-validation, re-training the chosen model and evaluate it on the test set. Finally, re-train the chosen model and deploy it. Image by author.

Practical example

Here's a practical example of the complete process using scikit-learn. You can check the comments for more context.

import numpy as np
import pandas as pd

from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.model_selection import (GridSearchCV,
                                     KFold,
                                     train_test_split)
# creating a dummy data set
X, y = make_regression(n_samples=100)

# train test split
X_train, X_test, y_train, y_test = 
   train_test_split(X, y, shuffle=False, test_size=0.2)

# cv procedure
cv = KFold(n_splits=5)

# defining the search space
# a simple optimization of the number of trees of a RF
model = RandomForestRegressor()
param_search = {'n_estimators': [10, 50, 100]}

# applying CV with a gridsearch on the training data
gs = GridSearchCV(estimator=model, 
                  cv=cv, 
                  param_grid=param_search)
## the proper way of doing model selection
gs.fit(X_train, y_train)

# re-training the best approach for testing
chosen_model = RandomForestRegressor(**gs.best_params_)
chosen_model.fit(X_train, y_train)

# inference on test set and evaluation
preds = chosen_model.predict(X_test)
## unbiased performance estimate on test set
estimated_performance = r2_score(y_test, preds)

# final model for deployment
final_model = RandomForestRegressor(**gs.best_params_)
final_model.fit(X, y)

Nested cross-validation

The above is a simplified version of what's called nested cross-validation.

In nested cross-validation, you carry out a full internal cross-validation process in each fold of an external cross-validation process. The goal of the internal process is to select the best model. Then, the external process provides unbiased performance estimates for this model.

Nested cross-validation becomes inefficient quite quickly. It's only practical on small data sets.

Most practitioners settle for the process exemplified above. If you have a large data set, you can also replace the cross-validation procedure with a single split. This way, you get three partitions: training, validation, and testing.

Key Takeaways

Nesting and re-training are two essential aspects of cross-validation.

The performance estimates of the model selected by cross-validation can be too optimistic. So, you should make a three-way split to get reliable estimates. A three-way split is a form of nested cross-validation.

After selecting a model, or estimating its performance, you should re-train it with all available data. This way, the model is likely to perform better in new observations.

Thanks for reading, and see you in the next story!

References

[1] Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. Vol. 2. New York: springer, 2009.

[2] Cawley, Gavin C., and Nicola LC Talbot. "On over-fitting in model selection and subsequent selection bias in performance evaluation." The Journal of Machine Learning Research 11 (2010): 2079–2107.

Tags: Crossvalidation Data Science Deep Learning Machine Learning Time Series Forecasting