Neural Networks for Flexible Multivariate Forecasting

Author:Murphy  |  View: 27532  |  Time: 2025-03-22 19:55:38

Forecasting multiple time series can quickly become a complicated task; traditional approaches either require a separate model per series (i.e. SARIMA) or that all series are correlated (i.e. VARMA). Neural Networks offer a flexible approach that enables multi-series forecasts with a single model regardless of series correlation.

Additionally, this approach allows exogenous variables to be easily incorporated and can forecast multiple timesteps into the future resulting in a powerful general solution that performs well in a wide variety of cases.

In this article, we'll show how to perform the data windowing required to transform our data from a time series to supervised learning format for both a univariate and multivariate time series. Once our data has been transformed we'll show how to train both a Deep Neural Network and LSTM to make multivariate forecasts.

Examining Our Data

We'll be working with a dataset capturing daily mean temperature and humidity in Delhi India between 2013 and 2016. This data is available on Kaggle and is licensed for usage under the CC0: Public Domain making it ideal for our project.

Displaying the data in tabular format, we see that we have 3 columns, 1 column containing the datetime of the measurement one for each of ‘meantemp' and ‘humidity'. Mean Temperature and Humidity will form the two time series of interest that we want to forecast for this project.

Figure 1: Base Dataset

We can graph this data, Figure 2, and see the scale is fairly consistent between the two and there appears to be some inverse correlation where higher mean temperatures are associated with lower humidity.

# Create traces
fig = go.Figure()

for factor_level in ids:
    # Adding plot of original_df
    fig.add_trace(go.Scatter(x=formatted_df['Date'], y=formatted_df[factor_level],
                        mode='lines',
                        name=factor_level))
fig.update_layout(title = "Temp and Humidity over Time, Delhi India")
Figure 2: Temp and Humidity over time

Data Windowing

Now for the most difficult part of this approach; ensuring our data is in the correct format using a windowed approach. The problem we face is that we need to convert a univariate time series Y into two variables X1 and Y1 where each row in X1 contains the n timesteps before the corresponding row in Y1. In this example, n is our window size and the data transformation can be seen below in Figure 1.

If our initial Y dataset is 10 points and we set our window size n to 2 then our output X1 and Y1 will be of dimensions (8 x 2) and (8 x 1) respectively. With these two new variables we can then fit a normal regression model to model the value of each of our Y1 values as a function of the two preceeding ones.

As an example in Figure 3, we can see that with a window size of 2 our first Y value would be the 3rd index in the initial Y series, and it would be dependent on the values in index 1 and 2 in Y.

Figure 3: Univariate Window Transformation

If instead of a single time-series we are interested in forecasting multiple we can perform the exact same process and concatenate each of the datasets together as shown in Figure 2. In the below Figure, we can see that we have two initial time series Y1 and Y2, they are both of the same length and we again choose a window size of 2 to perform our transformation. Once each series is transformed as we did in the univariate example we simply concatenate them together and have an XT of (t-2 x 2n and a Y_T of (t-n x 2).

Figure 4: Multivariate Window Transformation

If we had one timeseries that we wanted to forecast and one exogenous series we could perform the the same transformation in Figure 2 except our Y_T would just have the one column rather than 2. We might do this for example if we wanted to forecast air quality as our response variable and had an exogenous variable like Commuter Miles Driven that was correlated with our response variable but we weren't interested in forecasting.

Applying Data Transformation

Now we turn to applying the data windowing process to our dataset. In this case, our base dataset has 2 columns we want to forecast labeled ‘meantemp' and ‘humidity' and a date column as shown in Figure 1.

The first thing we'll do is apply a Min-Max scaler to ensure that each of the columns is on the same scale and improve model performance. Unlike traditional ARIMA process's neural networks dont require a dataset to be stationary for forecasting however standardization and normalization have been shown to significantly increase performance [3].

# creating scalar
scaler = MinMaxScaler(feature_range=(0, 1))

# normalizing target columns
for col in ids:
    formatted_df [col] = scaler.fit_transform(formatted_df[[col]])
formatted_df.head(10)
Figure 5: Scaled Dataset

Next, we split our dataset into a test set and train set, our data is already sorted by Date in ascending order so we can take the first portion of our dataset as our train set and the end portion as the test set.

Note: When working with time series data it's critical to ensure data is sorted and split by time so we don't have data leakage from the future into the past.

# setting test/train ratio
total_observations = len(formatted_df)
train_ratio = 0.7

# performing time based test/train split
train_df = formatted_df[:int(total_observations * train_ratio)]
test_df = formatted_df[int(total_observations * train_ratio):]
print(len(train_df), len(test_df))

Finally, we need to transform our data from a standard time series format to a windowed format as illustrated in Figures 3 and 4. To do this we define 2 helper functions, convert_to_supervised, which takes a 1-dimensional time series and converts to a windowed format, and rename_dataframe_supervised which updates the column names to T-1, T-2, …, T-n.

def convert_to_supervised(data, window_size=1, forecast_size=1, dropnan=True):

    '''Converts a 1d time series dataset into a 2d supervised learning format'''

    df = pd.DataFrame(data)
    cols = list()
    # Training sequence: t-window_size, ..., t-1
    for i in range(window_size, 0, -1):
        cols.append(df.shift(i))

    # Forecast sequence: (t, t+1, ... t+forecast_size)
    for i in range(0, forecast_size):
        cols.append(df.shift(-i))

    # Concatenating columns together
    agg = pd.concat(cols, axis=1)

    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg.values

def rename_dataframe_supervised(data_df, state_name = ""):

   '''Takes a dataframe with column names 1-x, relables them as t through t - x
   and returns. Used to illustrate how the timeseries to supervised conversion works'''

   # Renaming columns in dataframe
   max_col = max(data_df.columns.astype(int))
   for col in data_df.columns:
      if int(col) == max_col:
         data_df.rename(columns={col:"t"},inplace=True)
      else:
         data_df.rename(columns={col:"{state}t - {val}".format(state = state_name,val = max_col-col)},inplace=True)
   return data_df

With our helper functions defined, we can now transform our data by iterating through each of the series (in this case ‘meantemp' and ‘humidity'), converting the series to a supervised format, renaming it, and then concatenating to a numpy array.

timesteps = 6
features = len(ids)

count = 0
# Storing results
train_x_all, test_x_all = np.array([]), np.array([])
train_y_all, test_y_all = np.array([]), np.array([])

# Iterating through each series of interest
for val in ids:

    train_vals, test_vals = train_df[val].values, test_df[val].values

    # Transforming to time series                                                                                                                                                       
    train_data, test_data = convert_to_supervised(train_vals, window_size=timesteps), convert_to_supervised(test_vals, window_size=timesteps)

    # Converting to dataframe
    train_df_sup, test_df_sup = rename_dataframe_supervised(pd.DataFrame(train_data)), rename_dataframe_supervised(pd.DataFrame(test_data))

    # Separating x and y
    train_y, train_x = train_df_sup['t'].to_numpy(), train_df_sup.drop(columns = ['t']).to_numpy()
    test_y, test_x = test_df_sup['t'].to_numpy(), test_df_sup.drop(columns = ['t']).to_numpy()

    if count == 0:
        train_x_all = train_x
        train_y_all = train_y
        test_x_all = test_x
        test_y_all = test_y
    else:
        train_x_all = np.concatenate((train_x_all, train_x), axis = 1)
        train_y_all = np.stack((train_y_all, train_y), axis = 0)
        test_x_all = np.concatenate((test_x_all, test_x), axis = 1)
        test_y_all = np.stack((test_y_all, test_y), axis = 0)

    count += 1
train_y_all = train_y_all.T
test_y_all = test_y_all.T
print('Y-Train Shape: ', train_y_all.shape)
print('X-Train Shape: ', train_x_all.shape)

After performing the above transformation we want to verify that our data is in the correct format by examining the X/Y training sets. Our original dataset had 1462 observations, using 70% of it for our training set we would expect 1023 training observations however we would lose 6 of these due to our window size. As a result, our new training set should have 1017 rows and 2 and 12 columns respectively for our Y set and X set. This is exactly what we have as shown in Figure 4 so we can proceed.

Figure 6: Output Data Dimensions

Before we move on to model training there's one final data transformation we need to make. We have been reshaping our data in 2 dimensions however our neural network will expect a 3 dimensional array for X where the 3rd dimension is the number of timesteps to forecast into the future. In this case, we're only interested in forecasting one timestep into the future so we can reshape our dataframe as shown below from (1017 x 12) -> (1017 x 1 x 12). Our final X format will be a 3d numpy array of dimensions [rows(observations) x timesteps x columns(features)].

Now the complicated part is done and we're ready to jump into modeling!

# need data formatted as [rows x timesteps into future x columns]
train_x_all = np.reshape(train_x_all, (train_x_all.shape[0], 1, train_x_all.shape[1]))
test_x_all = np.reshape(test_x_all, (test_x_all.shape[0], 1, test_x_all.shape[1]))

# checking shape
print('Train X shape:', train_x_all.shape)
print('Test X shape:', test_x_all.shape)
Figure 7: Final Data Shape

Developing Our First Model: Deep Neural Networks

Now that our data has been transformed into the required windowed format we're ready to define and train our first model. To start we'll use a basic Deep Neural Network with two hidden layers. Unlike traditional approaches such as ARIMA/SARIMA which assume a linear relationship Deep Neural networks can model non-linear timeseries which can be a significant advantage depending on the dataset [1].

Because we'll be comparing multiple model types our first move is to define a train_and_predict function shown below, this function takes a model, compiles and fits it, and then makes a prediction and returns it.

To use this with a deep neural network we define our model as using 16 neurons in each Dense hidden layer and specifying a relu activation function. This model is then passed to our train and predict function which returns the results.

Given that our Neural Network is small we can show it in an illustration in Figure 8. Because we chose a window size of 6 and have two series we're going to have a total of 12 inputs, shown by the 12 neurons in the input layer, we then have our 2 Dense hidden layers and finally our two outputs, one for each series.

One important feature of this network is that each input neuron is connected to every neuron in the first hidden layer, what this means is that our model is effectively using data from Series B to help predict Series A and vice versa. This is a significant benefit of using a unified model over separate individual models, if there's a relationship between the two series we can capture this and use it to improve our overall forecast performance.

Figure 8: DNN Illustration

Defining our model:

# defining DNN model
units = 64
model_dnn = Sequential()
model_dnn.add(Dense(units = units, activation = 'relu'))
model_dnn.add(Dense(units = units, activation = 'relu'))
model_dnn.add(Dense(features))
epochs = 100

# training model and returning predictions
trainPredict_dnn, testPredict_dnn, trainY, testY = train_and_predict(model_dnn, train_x_all, test_x_all, train_y_all, test_y_all, epochs)

Defining our train and predict function that we'll pass the model to:

def train_and_predict(model, train_x_all, test_x_all, train_y_all, test_y_all, epochs):

    # compiling and fitting model
    model.compile(loss='mean_absolute_error', optimizer='adam')
    model.fit(train_x_all, train_y_all, epochs=epochs, batch_size=1, verbose=3)

    # make predictions
    trainPredict = model.predict(train_x_all).reshape(train_y_all.shape[0], features)
    testPredict = model.predict(test_x_all).reshape(test_y_all.shape[0], features)

    # inverting predictions back to initial scale
    trainPredict = scaler.inverse_transform(trainPredict)
    trainY = scaler.inverse_transform(train_y_all)
    testPredict = scaler.inverse_transform(testPredict)
    testY = scaler.inverse_transform(test_y_all)

    return trainPredict, testPredict, trainY, testY 

Now we've successfully trained our deep neural network and used it to make predictions but we need a method to evaluate its performance. To do this we define a plot_comparison function which takes our predictions and ground truth labels and then graphs each of the predicted versus actual series and calculates the Mean Absolute Percentage Error(MAPE) for each of them.

def plot_comparison(train_pred, test_pred, train_y, test_y, dates, model_type):

    factor_levels = train_pred.shape[1]
    fig = make_subplots(rows=factor_levels, cols=1, subplot_titles=("Plot 1", "Plot 2"))
    mapes = []

    # iterating through factor levels
    for level in range(factor_levels):
        fig.layout.annotations[level].update(text =  "Updated Plot 1")
        # defining dates for x axis
        train_dates = dates[:train_pred.shape[0]]
        test_dates = dates[-test_pred.shape[0]:]
        fig.append_trace(go.Scatter(x=train_dates, y=train_y[:, level],
                            mode='lines',
                            name='Actual-Train'),
                            row=level+1, col=1)
        # test set
        fig.append_trace(go.Scatter(x=test_dates, y=test_y[:, level],
                            mode='lines',
                            name='Actual-Test'),
                            row=level+1, col=1)

        # train pred
        fig.append_trace(go.Scatter(x=train_dates, y=train_pred[:, level],
                            mode='markers',
                            name='Pred-Train'),
                            row=level+1, col=1)
        # test pred
        fig.append_trace(go.Scatter(x=test_dates, y=test_pred[:, level],
                            mode='markers',
                            name='Pred-Test'),
                            row=level+1, col=1)

        # calculating mape on test set and updating title
        mape = round(mean_absolute_percentage_error(test_y[:, level],test_pred[:, level]),4)
        fig.layout.annotations[level].update(text =  "Series {series}, Test MAPE, {mape}".format(mape = mape, series = level + 1))
        mapes.append(mape)

    fig.update_layout(height=600, width=600, title_text="{model_type}, Forecast By Series, Avg MAPE {mape}".format(model_type = model_type, mape = round(sum(mapes)/len(mapes), 4)))
    fig.show()

We can now call our plot_comparison function to evaluate how well the Deep Neural Network performed.

# graphing data
dates = formatted_df['Date'].sort_values(ascending=True).drop_duplicates().to_list()
model_type = 'DNN'
plot_comparison(trainPredict_dnn, testPredict_dnn, trainY, testY, dates, model_type)

In Figure 9 we can see the results of this evaluation showing the forecasted vs actual values for the test and train sets for each series(‘meantemp ‘and ‘humidity'). We can see that our model's predictions were reasonable approximations of the actual values with an overall Average MAPE of 0.0847 across both of our series.

Congratulations, we've now trained and evaluated our first Neural Network Forecaster!

Figure 9: Forecast Performance DNN

RNN's and LSTM's, and Overview

Now we've trained and evaluated our first model but want to see if we can improve; enter LSTM's. First some quick background, what is an LSTM(Long Short Term Memory) and why might it outperform a basic feed-forward neural network?

An LSTM is a form of recursive neural network(RNN), a network that feeds back into itself allowing it to have a form of ‘memory' that can be useful for problems where the order of data is important; time series forecasting, NLP, text translation, and others. The basic structure is shown in Figure 10, if we unroll this we get the structure on the right-hand side where an input from time 0 is fed to time 1, this is then modified and sent to time 2, and onward. More information can be found here: https://www.ibm.com/topics/recurrent-neural-networks.

Figure 10: Recursive Neural Network(RNN) structure, By fdeloche – Own work, CC BY-SA 4.0

While this sounds great, we want to train an LSTM rather than an RNN because an RNN can suffer from a vanishing/exploding gradient problem due to the feedback loop shown in Figure 10. Imagine there is some initial input X0, this input is then going to be scaled by some weight W in each recursive step. This will result in an output value at time t equal to the formula shown in Figure 11. If for example we had 10 recurrent steps and our weight was 2 then the input to the 10th neuron would be X0 2¹⁰ or 1024 X0 which would overwhelm any Xt value. Alternatively, the gradient can vanish if the weight is less than 1 as a value less than 1 to a large power is almost 0.

The bottom line is the vanishing and exploding gradient problem can either result in past events having no weight in our model or outweighing any current signal.

Figure 11: RNN Gradient Function

An LSTM is designed to overcome this issue by having two memory paths, one for short term and one for long term. This makes an LSTM better suited for time series forecasting than a basic RNN however unfortunately it also makes the base LSTM unit much more complicated than that of an RNN.

At a very high level the LSTM unit can be seen in Figure 12, the top pathway outputting Ct represents the long term memory. You'll notice that the long term memoryterm has a muliplication element and an addition element to modify the value however there isnt any weight present to avoid the issue of vanishing/exploding gradients that we saw with our base RNN.

The pathway on the bottom of Figure 12 outputting Ht represents our short-term memory and unlike our long-term memory, it does have weights that can modify it. However, because this pathway is for the short term only and won't be compounded a large number of times we shouldn't see exploding or vanishing gradients.

There are more components and interactions between short and long-term memory that are beyond the scope of this article, if interested my favorite explanation on LSTM structure is from StatQuest here: https://www.youtube.com/watch?v=YCzL96nL7j0. For our purposes the important thing to know is that LSTM's we're designed to overcome the shortcomings of RNN's and have been shown to outperform both traditional ARIMA approaches and deep neural networks in many applications [2].

Figure 12: LSTM unit structure, By Guillaume Chevalier, Licensed under Creative Commons [4].

Training an LSTM

Now that we've had a brief overview of the theory behind an LSTM we can move on to implementing one which is fortunately much easier due to the magic of tensorflow.

We define a simple sequential model below and add one LSTM layer and one Dense layer to provide an output. In this case, we define our timesteps as 1 because we are only interested in making a 1 timestep forecast and features as 2 since we have 2 series that we are interested in forecasting. Finally, we pass our model to the same train_and_predict function that we used for the DNN.

# defining lstm model
units = 8
model_lstm = Sequential()
model_lstm.add(LSTM(units, input_shape=(1, int(timesteps * features))))
model_lstm.add(Dense(features))
epochs = 100
# training model and returning predictions
trainPredict_lstm, testPredict_lstm, trainY, testY = train_and_predict(model_lstm, train_x_all, test_x_all, train_y_all, test_y_all, epochs)

After making our predictions we want to examine the results and see if how our LSTM performed relative to the DNN. To do this can simply call the same plot_comparison method used previously. The results of this are shown in Figure 13, similar to our deep neural network the LSTM forecast closely mirrors our actual values and we can see our average MAPE is even lower 0.0812 vs. 0.0847.

Our LSTM has outperformed our DNN.

# graphing data
dates = formatted_df['Date'].sort_values(ascending=True).drop_duplicates().to_list()
model_type = 'LSTM'
plot_comparison(trainPredict_lstm, testPredict_lstm, trainY, testY, dates, model_type)
Figure 13: Forecast Performance LSTM

Wrapping Up

Multivariate Time Series forecasting can easily become a complicated task with different models for different time series and the need to track and maintain these models. As shown in this article Neural Networks can provide an easy multi-output solution, enabling forecasting of multiple series simultaneously. This reduces the number of models we need to train and enables our model to use one series to help predict another if there's correlation between the two.

Ultimately there's no magic bullet and you should use the model that performs best for your use case but neural networks can be an incredibly useful tool for time series forecasting and should be in any toolbox.

Citations:

  1. Casolaro, A., Capone, V., Iannuzzo, G., & Camastra, F. (2023). Deep Learning for Time Series Forecasting: Advances and Open Problems. Information, 14(11), 598. https://doi.org/10.3390/info14110598
  2. S. Siami-Namini, N. Tavakoli and A. Siami Namin, "A Comparison of ARIMA and LSTM in Forecasting Time Series," 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA, 2018, pp. 1394–1401, doi: 10.1109/ICMLA.2018.00227. keywords: {Time Series Analysis;Forecasting;Predictive models;Autoregressive processes;Economics;Deep learning;Data models;Deep Learning, Long Short-Term Memory (LSTM), Autoregressive Integrated Moving Average (ARIMA), Forecasting, Time Series Data},
  3. Tawakuli, A., Havers, B., Gulisano, V., Kaiser, D., & Engel, T. (2024). Survey:Time-series data preprocessing: A survey and an empirical analysis. Journal of Engineering Research. https://doi.org/10.1016/j.jer.2024.02.018
  4. Long short-term memory. (2024, October 17). In Wikipedia. https://en.wikipedia.org/wiki/Long_short-term_memory
  5. Recurrent neural network. (2024, October 24). In Wikipedia. https://en.wikipedia.org/wiki/Recurrent_neural_network

Resources

  1. Datasource: https://www.kaggle.com/datasets/sumanthvrao/daily-climate-time-series-data
  2. Code for this article on Github: https://github.com/pinstripezebra/forecasting/blob/main/Neural%20Network%20Forecast%20-%20Simple.ipynb
  3. LSTM Implementation in Keras: https://keras.io/api/layers/recurrent_layers/lstm/
  4. A brief explanation of LSTM's: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
  5. Youtube video covering LSTM architecture: https://www.youtube.com/watch?v=YCzL96nL7j0

Graphs and Figures: Unless otherwise noted, all images are by the author.

Tags: Data Manipulation Data Science Hands On Tutorials Time Series Analysis Time Series Forecasting

Comment