Stacking Time Series Models to Improve Accuracy

Research around powerful time series models is robust. There are many options available, far beyond the classical techniques like Arima. Lately, recurrent neural networks and LSTMs have reached the top of many researchers' interests. Going by the number of downloads listed on PePy, the Prophet model is probably the most widely used model by time series forecasters, due to its low barrier of entry. Which of these options is best for your series?
Maybe the answer is that you should try all of them and combine their various strengths. A common technique to do this is known as stacking. The popular machine learning library, scikit-learn, offers a StackingRegressor that can be employed for time series tasks. I have previously demonstrated how to use it for just such a case.
There is a limitation to the StackingRegressor, however; it only accepts other scikit-learn model classes and APIs. So, a model like ARIMA, which is unavailable in scikit-learn, or deep networks from tensorflow, becomes unavailable to place in the stack. There is a solution. In this post, I will show how to extend stacking methods for time series using the scalecast package and a Jupyter notebook. The dataset used is available open access on GitHub. The following requirements are needed:
pip install --upgrade scalecast
conda install tensorflow
conda install shap
conda install -c conda-forge cmdstanpy
pip install prophet
Dataset Overview
The dataset is hourly and split into a train set (700 observations) and test set (48 observations). My notebook uses the H1 series, but modifying it to utilize any of the hourly series from the M4 dataset is straightforward. This is how to read the data and store it in a Forecaster
object:
import pandas as pd
import numpy as np
from scalecast.Forecaster import Forecaster
from scalecast.util import metrics
import matplotlib.pyplot as plt
import seaborn as sns
def read_data(idx = 'H1', cis = True, metrics = ['smape']):
info = pd.read_csv(
'M4-info.csv',
index_col=0,
parse_dates=['StartingDate'],
dayfirst=True,
)
train = pd.read_csv(
f'Hourly-train.csv',
index_col=0,
).loc[idx]
test = pd.read_csv(
f'Hourly-test.csv',
index_col=0,
).loc[idx]
y = train.values
sd = info.loc[idx,'StartingDate']
fcst_horizon = info.loc[idx,'Horizon']
cd = pd.date_range(
start = sd,
freq = 'H',
periods = len(y),
)
f = Forecaster(
y = y, # observed values
current_dates = cd, # current dates
future_dates = fcst_horizon, # forecast length
test_length = fcst_horizon, # test-set length
cis = cis, # whether to evaluate intervals for each model
metrics = metrics, # what metrics to evaluate
)
return f, test.values
f, test_set = read_data()
f # display the Forecaster object
Here is a plot of the series:

Applied Models
Before we start staking models, we need to generate predictions from them. I chose to use a Naive model as a benchmark against ARIMA, Lstm, and Prophet models. In the subsequent sections, I go through why I made each selection, but other decisions could be made that are just as interesting or even more so.
Naive Benchmark
To benchmark all models, we can call a seasonal naive estimation, which propagates the last 24 observed values in any given hourly series forward. In scalecast, this is done pretty easily.
f.set_estimator('naive')
f.manual_forecast(seasonal=True)
ARIMA
Autoregressive Integrated Moving Average is a popular and simple time series technique that uses a series' lags and errors to make predictions about its future in a linear fashion. Through an exploratory data analysis (which you can see in the linked notebook), I determined the series was not stationary and that it was highly seasonal. I ultimately chose to apply a seasonal ARIMA model of order (5,1,4) x (1,1,1,24).
f.set_estimator('arima')
f.manual_forecast(
order = (5,1,4),
seasonal_order = (1,1,1,24),
call_me = 'manual_arima',
)
LSTM
If ARIMA is on the simpler side of time series models, LSTM is one of the more advanced methods. It is a deep learning technique with many parameters, including an attention mechanism that finds long and short-term patterns in sequential data, which theoretically makes it an ideal choice for time series. It is difficult to set up this model using tensorflow, but not too bad in scalecast (see this article). I applied two LSTM models: one with a Tanh activation function and one with ReLu.
f.set_estimator('rnn')
f.manual_forecast(
lags = 48,
layers_struct=[
('LSTM',{'units':100,'activation':'tanh'}),
('LSTM',{'units':100,'activation':'tanh'}),
('LSTM',{'units':100,'activation':'tanh'}),
],
optimizer = 'Adam',
epochs = 15,
plot_loss = True,
validation_split=0.2,
call_me = 'rnn_tanh_activation',
)
f.manual_forecast(
lags = 48,
layers_struct=[
('LSTM',{'units':100,'activation':'relu'}),
('LSTM',{'units':100,'activation':'relu'}),
('LSTM',{'units':100,'activation':'relu'}),
],
optimizer = 'Adam',
epochs = 15,
plot_loss = True,
validation_split=0.2,
call_me = 'rnn_relu_activation',
)
Prophet
Despite its extreme popularity, the Prophet model has been maligned recently. There are claims that it is underwhelming in accuracy, mainly due to how unrealistically it extrapolates trends and that it doesn't consider local patterns through modeling autoregressively. However, it does some things uniquely. For one, it automatically applies holiday effects to models. It also considers several types of seasonality. It does this all with minimum specification needed from the user. I like the idea of using it as a signal, even if it is not ideal for generating point predictions.
f.set_estimator('prophet')
f.manual_forecast()
Compare Results
Now that we have predictions generated for each model, let's see how they performed on the validation set, which are the last 48 observations in our training set (this is still separate from the test set mentioned earlier).
results = f.export(determine_best_by='TestSetSMAPE')
ms = results['model_summaries']
ms[
[
'ModelNickname',
'TestSetLength',
'TestSetSMAPE',
'InSampleSMAPE',
]
]

The good news here is that each model out-performed the naive method. The ARIMA model performed the best with a percentage error of 4.7%, followed by Prophet. Let's see all forecasts plotted against the validation set:
f.plot(order_by="TestSetSMAPE",ci=True)
plt.show()

All of these models performed reasonably on this series, without large deviations between any of them. Let's stack them!
Stacking Models
Every stacking model needs a final estimator, which will filter through the other models' various estimates to create a new set of predictions. We will stack our previously explored results with a boosted tree estimator called Catboost. Catboost is a powerful procedure that we hope will flesh out the best signals from each of the already-applied models.
f.add_signals(
f.history.keys(), # add signals from all previously evaluated models
)
f.add_ar_terms(48)
f.set_estimator('catboost')
The above code adds the predictions from each of the evaluated models into the Forecaster
object. It calls these predictions "signals." They are treated the same as any other covariate stored in this same object. We also add the last 48 series' lags as additional regressors that the Catboost model can use to make its forecast. Let's now call three Catboost models: 1 that uses all available signals and series lags, one that only uses the signals, and one that uses only the lags.
f.manual_forecast(
Xvars='all',
call_me='catboost_all_reg',
verbose = False,
)
f.manual_forecast(
Xvars=[x for x in f.get_regressor_names() if x.startswith('AR')],
call_me = 'catboost_lags_only',
verbose = False,
)
f.manual_forecast(
Xvars=[x for x in f.get_regressor_names() if not x.startswith('AR')],
call_me = 'catboost_signals_only',
verbose = False,
)
Let's tap into that test set we kept separate from the Forecaster object at the beginning of the analysis and compare the results of all applied models. This time, we will look at two measures: SMAPE and Mean Absolute Scaled Error (MASE). These are the two metrics that were used in the actual M4 competition.
test_results = pd.DataFrame(index = f.history.keys(),columns = ['smape','mase'])
for k, v in f.history.items():
test_results.loc[k,['smape','mase']] = [
metrics.smape(test_set,v['Forecast']),
metrics.mase(test_set,v['Forecast'],m=24,obs=f.y),
]
test_results.sort_values('smape')

By combining signals from a diverse class of models, we have generated two estimators that outperform the others: the Catboost model that was trained using all signals and 48 series lags followed by the Catboost model that just used the signals. These both scored an error of about 2.8% out of sample. We can see these two models plotted with the actuals from the test set.
fig, ax = plt.subplots(figsize=(12,6))
f.plot(
models = ['catboost_all_reg','catboost_signals_only'],
ci=True,
ax = ax
)
sns.lineplot(
x = f.future_dates,
y = test_set,
ax = ax,
label = 'held out actuals',
color = 'darkblue',
alpha = .75,
)
plt.show()

Which Signals Were Most Important?
To round out the analysis, we can use shapley scores to determine which signals were the most important in this stack. Shapley scores are considered one of the state-of-the-art methods to determine the inputs' predictive power in a given Machine Learning model. Higher scores mean the inputs were more important in that particular model.
f.export_feature_importance('catboost_all_reg')

The above screenshot only shows the first-few most important predictors, but we can see from this that the ARIMA signal was most important, followed by the series' first lag and then the Prophet signal. The RNN models also scored higher than many of the included lags. This can be a good starting place if we want to train a more parsimonious model in the future.
Conclusion
In this post, I demonstrated the power of stacking models in a time-series context and how using diverse model classes led to higher accuracy on the explored series. This is all made easy with the scalecast package. If you found the analysis interesting, please give that package a star on GitHub. Thank you for following along!