Putting Your Forecasting Model to the Test: A Guide to Backtesting

Evaluating time series models is not a simple task. In fact, it is quite easy to make serious errors while evaluating Forecasting models. While these errors may not break the code or prevent us from obtaining some output numbers, they can significantly affect the accuracy of such performance estimates.
In this article, we will demonstrate how to properly evaluate time series models.
Why are standard machine learning methods not suitable for time series?
The simplest way to evaluate the performance of a machine learning model is to split the dataset into two subsets: training and test sets. To further improve the robustness of our performance estimate, we may want to split our dataset multiple times. This procedure is called cross-validation.
The following diagram represents one of the most popular types of cross-validation – the k-fold approach. In the case of 5-fold validation, we first divide the dataset into 5 chunks. Then, we train the model using 4 of these chunks and evaluate its performance on the 5th chunk. This process is repeated 4 more times, each time holding out a different chunk for evaluation.

Based on the diagram, you can probably identify the problem with using this approach for forecasting. In most cases, we would train the model using data that chronologically comes after the evaluation set. This leads to data leakage, which we should absolutely avoid. A potential risk is that a model might learn patterns from the future that have not yet revealed themselves in the past. As such, it would lead to overly optimistic performance estimates.
K-fold cross-validation, along with many other approaches, operates under the assumption that the observations are independent. The temporal dependencies in time series data clearly do not align with this assumption, which makes most of the validation approaches popular in regression or classification unusable. That's why we must use validation methods that are tailored to time series data.
Note: Bergmeir et al. show that in the case of a purely autoregressive model, the use of standard K-fold CV is possible as long as the considered models have uncorrelated errors. You can read more about it here.
What is backtesting?
Backtesting (also known as hindcasting or time series cross-validation) is a set of validation approaches designed to meet the specific requirements of time series. Similar to cross-validation, the goal of backtesting is to obtain a reliable estimate of a model's performance after being deployed. We can also use these approaches for hyperparameter tuning and feature selection.
The idea of backtesting is to replicate a realistic scenario. The training data should correspond to the data available for training a model at the moment of making a prediction. The validation set should reflect the data we would encounter after deploying that model.
Below we present a diagram of an approach called walk-forward validation (or expanding window validation), which follows the characteristics we have just described. At each subsequent time point, we have a bit more data to train our model, and correspondingly, our test set advances by the same time interval. This type of validation preserves the temporal order of the time series.

Walk-forward validation is the simplest approach to backtesting. We could consider some of its modifications, which might better suit our use case:
- We have assumed an expanding window. However, we might want to train our model using only the latest subset of our time series. Then, we should use a rolling window of a fixed size instead.
- We can follow various strategies when it comes to refitting our model. In the simplest case, we refit the model in each iteration of the backtest. Alternatively, we could fit the model only in the first iteration and then create predictions using an already fitted model (with potentially updated features). Or we could refit it every X iterations. Once again, we should select a solution that closely mirrors the real-life use case of the model.
- We could introduce a gap between the training and validation sets, as the initial part of the validation set might be highly correlated with the final part of the training set. By creating a gap (for example, by removing the training observations close to the validation set), we enhance the independence between the two sets. This process is also known as purging.
- Another complexity might arise when working with multiple time series, such as sales of different products. Since time series in our dataset are likely correlated with each other (at least to some extent), we might want to keep each series in specific folds to prevent information leakage. For more information on that, please refer to the links mentioned in the References section.
In the next part of the article, we will demonstrate how to create a custom backtesting class using the simplest case of walk-forward validation. I highly encourage you to experiment and try coding the other possibilities yourself. Alternatively, you can always use the backtesting capabilities of popular Python libraries dedicated to Time Series Forecasting.

Hands-on example
First, we import the required libraries. As we want to create a custom backtesting class, we will be using standard libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
Generating data
To keep things simple, we will generate a daily time series spanning 4 years.
Later on, we will test a few models, including two ML models. To prepare for that, we will create monthly dummies as a way of accounting for seasonality. Please refer to this article for more details on that approach.
Defining the backtester class
The custom backtester class is quite lengthy, so we will first take a look at the code here and then analyze it piece by piece.
Let's start with the inputs:
pred_func
– a function that generates the forecasts given training data and the required features. We decided to use this approach, as we want to keep our class flexible and allow the users to use whichever ML libraries/frameworks they want to.start_date
– the start date of the backtest.end_date
– the end date of the backtest.backtest_freq
– how frequently should we train the model and create predictions. For example, by providing "7D" we will create a new set of predictions every week starting from the start date.data_freq
– the frequency of the data. We use this to create the predictions for the correct dates.forecast_horizon
– combined withdata_freq
, we use this one to make sure that we create forecasts for the desired horizon.
In the run_backtest
method, we iterate through each forecast date within the backtest. For each date, we separate the training set (containing all the information available at the time of making the forecast) from the validation set (determined by the forecast horizon). Subsequently, we generate forecasts and store the predicted values together with the actual values. In the final step, we combine all the individual DataFrames into one DataFrame that contains all the predictions made throughout the backtest.
In the evaluate_backtest
method, we use the previously generated DataFrame to compute various evaluation metrics. These metrics can be specified by providing a dictionary containing the metric's name and the corresponding function used for its calculation. We then calculate each of the requested metrics separately for each forecast horizon and on the whole.
The first step of the evaluate_backtest
method is to check if the DataFrame with the backtests is available (it becomes available after we have used the run_backtest
method). If it not available, we first need to actually run the backtest.
Running backtests
Now it is time to run the backtests. We will compare the performance of four "models":
- The naive forecast: In this approach, the forecast is equal to the last known value at the moment of making the forecast.
- The mean forecast: This forecast is equal to the mean of the training set.
- A linear regression model with month dummies as features.
- A random forest model with month dummies as features.
The first two models will serve as simple benchmarks, while the latter two aim to actually learn something. However, these are by no means good models. We use them only to illustrate the backtesting capabilities of our class.
In the following snippet, we define the functions used for obtaining the predictions. As mentioned earlier, we chose this approach to maintain flexibility and the ability to wrap any type of ML model into a function that returns predictions for the expected horizon.
In the following snippet, we define constants used to run the backtests. We also define a dictionary which we will populate with scores.
For metrics, we chose to focus on Mean Absolute Error (MAE) and Mean Squared Error (MSE). We chose those two to show that we can use multiple metrics in this approach. For the actual comparison of the models, we will focus on MAE.
In the following snippet, we backtest the naive model. Because we set verbose
to True, we can also inspect each of the iterations of the backtest. We print the ranges of the training and validation sets, together with the number of observations.
Additionally, we store the MAE scores in a dictionary, indicating the approach from which the scores come from. We do this to combine them later for evaluation.
Similarly, we perform backtests for each of the three remaining forecasting functions. For brevity, we do not include all of the code here, as it is quite similar. We only provide an example of a linear model, as we had to add the list of features as an additional argument to the run_backtest
method.
Backtest results
By combining the results of the backtests, we can see that the ML models outperformed the benchmarks in terms of MAE.
As a potential extension of the backtesting class it would be nice to plot some of the predictions vs. actuals to further evaluate the quality of the forecasts.
Wrapping up
- Due to temporal dependencies present in time series, traditional validation approaches such as k-fold cross-validation cannot be used.
- Backtesting (or time series cross-validation) consists of validation approaches designed to meet the specific requirements of time series.
- We can use backtesting to obtain a reliable estimate of a model's performance after being deployed. Additionally, we can also use these approaches for hyperparameter tuning and feature selection.
You can find the code used in this post here. As always, any constructive feedback is more than welcome. You can reach out to me on LinkedIn, Twitter or in the comments.
Liked the article? Become a Medium member to continue learning by reading without limits. If you use this link to become a member, you will support me at no extra cost to you. Thanks in advance and see you around!
You might also be interested in one of the following:
The Comprehensive Guide to Moving Averages in Time Series Analysis
A Comprehensive Guide on Interaction Terms in Time Series Forecasting
References
- Bergmeir, Christoph, Mauro Costantini, and José M. Benítez. "On the usefulness of cross-validation for directional forecast evaluation." Computational Statistics & Data Analysis 76 (2014): 132–143.
- Bergmeir, C., Hyndman, R. J., & Koo, B. (2018). A note on the validity of cross-validation for evaluating autoregressive time series prediction. Computational Statistics & Data Analysis, 120, 70–83.
- Racine, Jeff. "Consistent cross-validatory model-selection for dependent data: hv-block cross-validation." Journal of econometrics 99.1 (2000): 39–61.
- https://www.kaggle.com/code/jorijnsmit/found-the-holy-grail-grouptimeseriessplit
- https://stackoverflow.com/questions/51963713/cross-validation-for-grouped-time-series-panel-data
- https://datascience.stackexchange.com/questions/77684/time-series-grouped-cross-validation
All images, unless noted otherwise, are by the author.