Reduce Bias in Time Series Cross Validation with Blocked Split

Author:Murphy  |  View: 21404  |  Time: 2025-03-22 23:13:42

Reduce Bias in Time Series Cross Validation with Blocked Split

In my last post, I gave an introduction to cross validation for time series data by describing an expanding window approach, where the training set gradually gets larger and larger while the validation set stays the same.

This is a great way to get started with cross validating time series data. It introduces the idea that you should not randomly split your dataset and always make your validation set come after your train set.

But there's more we need to take into account.

The expanding window approach gradually increases the size of the training data. Because of this, with the exception of the first, each iteration will contain training data from the previous iteration.

Since the training set continuously gets larger and larger, there's a possibility of the model overfitting to the training dataset's patterns and reporting great performance. But once you try and predict on a final, holdout test set, the performance doesn't quite match what you previously saw.

Blocked time series split introduces a solution – it still maintains the temporal order of the data, but the train/test combinations never overlap.

Blocked Time Series Split. Image by author

This is especially useful because if you are cross validating, you should already know the training set size you'll be using. For example, if you know you'll be using one month of historical hourly data to predict the next 24 hours, you want your train/test splits in CV to mimic this process — Training on March to predict the first 24 hours of April. Then training April (minus the first 24 hours) to predict the first 24 of May, and so on until you reach your desired number of folds.

This way you can get a more accurate idea of how well the model will actually perform in production.

Unfortunately, there isn't a pre-set Python class like sklearn's TimeSeriesSplit for BlockedTimeSeriesSplit. You have to make it yourself. Luckily, that's all you have to do. As long as your BlockedTimeSeriesSplit class follows the implementation of other scikit learn splitting classes (eg KFold, TimeSeriesSplit), containing a split() method and n_splits, you can use it to do cross validation with scikit learn's cross_validate.

Here's how to create the class and implement its methods:

import numpy as np

# Source code from: https://hub.packtpub.com/cross-validation-strategies-for-time-series-forecasting-tutorial/
class BlockedTimeSeriesSplit():
    def __init__(self, n_splits):
        # Define 
        self.n_splits = n_splits

    def get_n_splits(self, X, y, groups):
        return self.n_splits

    def split(self, X, y=None, groups=None):
        # Split data using a blocked strategy
        # and return indices of where to split
        n_samples = len(X)
        k_fold_size = n_samples // self.n_splits
        indices = np.arange(n_samples)

        margin = 0
        for i in range(self.n_splits):
            start = i * k_fold_size
            stop = start + k_fold_size
            mid = int(0.8 * (stop - start)) + start
            yield indices[start: mid], indices[mid + margin: stop]

Now that you have a fully defined splitting class, instantiate an object from your new class.

btscv = BlockedTimeSeriesSplit(n_splits=3)

Using a very simple dataset, let's see how this splitting strategy divides the data into train and validation sets during cross-validation.

# Create a sample dataset X and y
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12]])
y = np.array([2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24])

# Print out the train and test (validation) sets for each fold
for i, (train_index, test_index) in enumerate(btscv.split(X)):
    print(f"Fold {i+1}:")
    print(f"  train:{X[train_index]}")
    print(f"  test:{X[test_index]}")

The three resulting folds look like such:

Fold 1: train:[[1] [2] [3]] test: [[4]]

Fold 2: train:[[5] [6] [7]] test: [[8]]

Fold 3: train:[[9] [10] [11]] test: [[12]]

Blocked time series split is a great solution if you have enough data to afford it. If not, standard time series split is still a good option and a vast improvement from random split. Ultimately, you should try both and see which one better predicts the model's performance on your final test set.

Sources

Packt Editorial Staff. "Cross-Validation strategies for Time Series forecasting [Tutorial]." Packt Hub, 6 May 2019, https://hub.packtpub.com/cross-validation-strategies-for-time-series-forecasting-tutorial/.

Tags: Artificial Intelligence Data Science Machine Learning Python Technology

Comment