Deep Learning for Forecasting: Preprocessing and Training

This article is a follow-up to a previous one. There, we learned how to transform a time series for Deep Learning.
We continue to explore deep neural networks for forecasting. In this post, we'll:
- Learn how to train a global forecasting model using deep learning, including basic preprocessing steps;
- Explore keras callbacks to drive the training process of a neural network.
Deep Learning for Forecasting
Deep neural networks tackle forecasting problems using auto-regression. Auto-regression is a modeling technique that involves using past observations to predict future ones.
Deep neural networks can be designed in different ways, such as recurrent or convolutional architectures. Recurrent neural networks are often preferred for time series data. Among other reasons, this type of network excels at modeling long-term dependencies. This feature can have a strong impact on forecasting performance.
Here's how to define a specific kind of recurrent neural network called LSTM (Long Short-Term Memory). The comments provide a brief description of each model element.
from keras.models import Sequential
from keras.layers import (Dense,
LSTM,
TimeDistributed,
RepeatVector,
Dropout)
# Number of variables in the time series.
# 1 means the time series is univariate
N_FEATURES = 1
# Number of lags in the auto-regressive model
N_LAGS = 24
# Number of future steps to be predicted
HORIZON = 12
# 'Sequential' instance is used to create a linear stack of layers
# ... each layer feeds into the next one.
model = Sequential()
# Adding an LSTM layer with 32 units and relu activation
model.add(LSTM(32, activation='relu', input_shape=(N_LAGS, N_FEATURES)))
# Using dropout to avoid overfitting
model.add(Dropout(.2))
# Repeating the input vector HORIZON times to match the shape of the output.
model.add(RepeatVector(HORIZON))
# Another LSTM layer, this time with 16 units
# Also returning the output of each time step (return_sequences=True)
model.add(LSTM(16, activation='relu', return_sequences=True))
# Using dropout again with 0.2 dropout rate
model.add(Dropout(.2))
# Adding a standard fully connected neural network layer
# And distributing the layer to each time step
model.add(TimeDistributed(Dense(N_FEATURES)))
# Compiling the model using ADAM and setting the objective to minimize MSE
model.compile(optimizer='adam', loss='mse')
Before, we learned how to transform a time series to train this model. But, sometimes you have several time series available.
How do you handle such cases?
Using many time series for deep learning
The rise of global methods
Forecasting models are usually created with the historical data of a time series. Such models can be referred to as local to that time series. By contrast, global methods pool the historical data of many time series to build a model.
The interest in global models surged when a method called ES-RNN won the M4 contest – a forecasting competition featuring 100000 different time series.
When and why to use a global model
Global models can provide considerable value in forecasting problems involving many time series. For example, in retail where the goal is to predict the sales of many products.
Another motivation for using this kind of approach is to have more data. Machine Learning algorithms are likely to perform better with larger training sets. This is especially so with methods with lots of parameters, such as deep neural networks. These are known to be data-hungry.
Global forecasting models do not assume that the underlying time series are dependent. That is, the lags of one series can be used to forecast the future values of another series.
Rather, these techniques exploit information from many time series to estimate the parameters of the model. When forecasting the future of a time series, the main input to the model is the past recent lags of that series.
Hands-On
In the rest of this article, we'll explore how to train a deep neural network using many time series.
Data
We'll use a data set about the power consumption in 8 regions across the USA:

The goal is to forecast power consumption in the following days. This problem is relevant for power systems operators. Accurate predictions help balance the supply and demand of energy.
We can read the data as follows:
import pandas as pd
# https://github.com/vcerqueira/blog/tree/main/data
data = pd.read_csv('data/daily_energy_demand.csv',
parse_dates=['Datetime'],
index_col='Datetime')
print(data.head())

Preprocessing steps
When training a deep neural network with multiple time series you need to apply some preprocessing steps. Here, we'll explore the following two:
- Mean-scaling
- Log transformation
The available set of time series can have different scales. Thus, it's important to normalize each series into a common value range. For global forecasting models, this is usually done by dividing each observation by the mean value of the respective series.
from sklearn.model_selection import train_test_split
# leaving last 20% of observations for testing
train, test = train_test_split(data, test_size=0.2, shuffle=False)
# computing the average of each series in the training set
mean_by_series = train.mean()
# mean-scaling: dividing each series by its mean value
train_scaled = train / mean_by_series
test_scaled = test / mean_by_series
After mean-scaling, the log transformation can also be helpful.
In a previous article, we explore how taking the log of time series is a useful transformation to handle heteroskedasticity. The log transformation can also help avoid saturation areas of the neural network. Saturation occurs when the neural network becomes insensitive to different inputs. This hampers the learning process, leading to a poor model.
import numpy as np
class LogTransformation:
@staticmethod
def transform(x):
xt = np.sign(x) * np.log(np.abs(x) + 1)
return xt
@staticmethod
def inverse_transform(xt):
x = np.sign(xt) * (np.exp(np.abs(xt)) - 1)
return x
# log transformation
train_scaled_log = LogTransformation.transform(train_scaled)
test_scaled_log = LogTransformation.transform(test_scaled)
Auto-regression
After pre-processing each time series, we need to transform them from sequences into a set of observations. For a single time series, you can check the previous article to learn the details of this process.
For several time series, the idea is similar. We create a set of observations for each series individually. Then, these are concatenated into a single data set.
Here's how you can do this:
# src module here: https://github.com/vcerqueira/blog/tree/main/src
from src.tde import time_delay_embedding
N_FEATURES = 1 # time series is univariate
N_LAGS = 3 # number of lags
HORIZON = 2 # forecasting horizon
# transforming time series for supervised learning
train_by_series, test_by_series = {}, {}
# iterating over each time series
for col in data:
train_series = train_scaled_log[col]
test_series = test_scaled_log[col]
train_series.name = 'Series'
test_series.name = 'Series'
# creating observations using a sliding window method
train_df = time_delay_embedding(train_series, n_lags=N_LAGS, horizon=HORIZON)
test_df = time_delay_embedding(test_series, n_lags=N_LAGS, horizon=HORIZON)
train_by_series[col] = train_df
test_by_series[col] = test_df
After that, you combine the data of each time series by a row-wise concatenation:
train_df = pd.concat(train_by_series, axis=0)
print(train_df)

Finally, we split the target variables from the explanatory ones as described before:
# defining target (Y) and explanatory variables (X)
predictor_variables = train_df.columns.str.contains('(t-|(t)')
target_variables = train_df.columns.str.contains('(t+')
X_tr = train_df.iloc[:, predictor_variables]
Y_tr = train_df.iloc[:, target_variables]
# transforming the data from matrix into a 3-d format for deep learning
X_tr_3d = from_matrix_to_3d(X_tr)
Y_tr_3d = from_matrix_to_3d(Y_tr)
# defining the neural network
model = Sequential()
model.add(LSTM(32, activation='relu', input_shape=(N_LAGS, N_FEATURES)))
model.add(Dropout(.2))
model.add(RepeatVector(HORIZON))
model.add(LSTM(16, activation='relu', return_sequences=True))
model.add(Dropout(.2))
model.add(TimeDistributed(Dense(N_FEATURES)))
model.compile(optimizer='adam', loss='mse')
# spliting training into a development and validation set
X_train, X_valid, Y_train, Y_valid =
train_test_split(X_tr_3d, Y_tr_3d, test_size=.2, shuffle=False)
# training the neural network
model.fit(X_train, Y_train, validation_data=(X_valid,Y_valid), epochs=100)
Using Callbacks for Training a Deep Neural Network

Deep neural networks are iterative methods. They go over the training dataset several times in cycles called epochs.
In the above example, we ran 100 epochs. But, it's not clear how many epochs one should run to train a network. Too few epochs can lead to underfitting; too many iterations lead to overfitting.
A way to handle this problem is by monitoring the performance of the neural network after each epoch. Each time the model improves performance, you save it before continuing the training process. Then, after the training is over, you get the best model that was saved.
In keras, you can use callbacks to handle this process for you. A callback is a function that performs some action during the training process. You can check keras documentation for a complete list of the available callbacks. Or how to learn to write your own!
The callback that is used to save the model during training is called ModelCheckPoint:
from keras.callbacks import ModelCheckpoint
model_checkpoint = ModelCheckpoint(
filepath='best_model_weights.h5',
save_weights_only=True,
monitor='val_loss',
mode='min',
save_best_only=True)
model = Sequential()
model.add(LSTM(32, activation='relu', input_shape=(N_LAGS, N_FEATURES)))
model.add(Dropout(.2))
model.add(RepeatVector(HORIZON))
model.add(LSTM(16, activation='relu', return_sequences=True))
model.add(Dropout(.2))
model.add(TimeDistributed(Dense(N_FEATURES)))
model.compile(optimizer='adam', loss='mse')
history = model.fit(X_train, Y_train,
epochs=300,
validation_data=(X_valid,Y_valid),
callbacks=[model_checkpoint])
Another interesting callback you can use for training is EarlyStopping. It can be used to stop training when performance has stopped improving.
Making predictions
After training, we can retrieve the best model and make predictions on the test set.
# The best model weights are loaded into the model.
model.load_weights('best_model_weights.h5')
# Inference on DAYTON region
test_dayton = test_by_series['DAYTON']
# spliting target variables from explanatory ones
X_ts = test_df.iloc[:, predictor_variables]
Y_ts = test_df.iloc[:, target_variables]
X_ts_3d = from_matrix_to_3d(X_ts)
# predicting on normalized data
preds = model.predict_on_batch(X_ts_3d)
preds_df = from_3d_to_matrix(preds, Y_ts.columns)
# reverting log transformation
preds_df = LogTransformation.inverse_transform(preds_df)
# reverting mean scaling
preds_df *= mean_by_series['DAYTON']
Key Take-Aways
- Many forecasting problems involve many time series, for example in the retail domain. In such cases, global methods are often better to build a model;
- Preprocessing the time series is important before training a neural network. Mean scaling helps bring all series into a common value range. Taking the log stabilizes the variance and avoids saturation areas;
- Use callbacks to drive the training process of neural networks.
Thank you for reading, and see you in the next story!
References
[1] Hourly Energy Consumption time series (License: CC0: Public Domain)
[2] Hewamalage, Hansika, Christoph Bergmeir, and Kasun Bandara. "Recurrent neural networks for Time Series Forecasting: Current status and future directions." International Journal of Forecasting 37.1 (2021): 388–427.
[3] Slawek Smyl, Jai Ranganathan, and Andrea Pasqua. M4 Forecasting Competition: Introducing a New Hybrid ES-RNN Model, 2018. URL https://eng.uber.com/ m4-forecasting-competition/.