Uncertainty Quantification in Time Series Forecasting

Author:Murphy | View: 28538 | Time: 2025-03-22 19:20:31

Most of my recent articles revolved around knowing how sure a model is about its predictions. If we know the uncertainty of predictions, we can make well-informed decisions. I showed you how we can use Conformal Prediction to quantify a model's uncertainty. I wrote about Conformal Prediction approaches for classification and regression problems.

Calibrating Classification Probabilities the Right Way

Increase Trust in Your Regression Model The Easy Way

For these approaches, we assume that the order of observation does not matter, i.e., that our data is exchangeable. This is reasonable for classification and regression problems. However, the assumption does not hold for time series problems. Here, the order of observations often contains important information, such as trends or seasonality. Hence, the order must stay intact when we want to quantify the uncertainty of our prediction.

Does this mean that we cannot apply Conformal Prediction to time series?

Luckily no. We only need to use a different algorithm.

In 2021 the first Conformal Prediction algorithm for time series, "ENsemble Batch Prediction Interval" (EnbPI), was published. The approach does not require data exchangeability. Hence, EnbPI can handle non-stationary and spatio-temporal data dependencies. The results of different forecasting tasks were promising. EnbPI's prediction intervals had approximately valid marginal coverage. Also, they maintained coverage where other methods failed.

So, how does EnbPI work?

The idea of EnbPI is to train one underlying model on different subsets of the data and then derive the prediction interval by ensembling these models. EnbPI creates the different subsets by applying a bootstrapping approach in which we randomly sample subsets of the time series with replacement.

As EnbPI extends Conformal Prediction to time series, the approach has similar advantages:

constructs distribution-free prediction intervals that reach approximate marginal coverage
can use any estimator
is computationally efficient during inference as we do not require retraining models
performs well on small datasets as we do not need a separate calibration set

However, we should note that training the ensemble takes more time as we train several models.

Now that we covered the high-level idea and advantages, let's look at how EnbPI works under the hood.

The EnbPI Recipe

We can split the algorithm into a training and prediction phase. Each phase consists of two steps.

Before we start training our EnbPI ensemble we must choose an ensemble estimator. This estimator can be any model, e.g., a boosted tree, a neural network, a linear regression, or a statistical model.

The steps when using EnbPI. The yellow boxes show the steps taken during training. The blue boxes show the steps taken when making a prediction. The step in the green box is optional and can be seen as an online update. (Image by the author)

Training Phase

The training phase consists of two steps: bootstrapping non-overlapping subsets from the training data and fitting a model to each of these subsets.

Step 1: Sample Bootstrap Subsets

Before we create the subsets we must decide how many models our ensemble should contain. For each model, we need one bootstrap sample.

One important requirement for these subsets is that they are not overlapping. This means that each subset is unique and independent of all other subsets. Each subset contains different data from the original data set. As we train each model on different data we introduce variability across the trained models in the ensemble. This increases the diversity of our ensemble and reduces overfitting.

Choosing a good ensemble size depends on the data set size. The smaller the data set, the fewer non-overlapping subsets we can create as we need to ensure that each subset has enough data points to train a model. Yet, a small ensemble size results in less diversity in our ensemble and thus a reduced performance of the EnbPI approach. This in turn will lead to wider prediction intervals.

The more models we train, the better our performance, i.e., the narrower the prediction interval. This is because we have a more diverse ensemble that can capture more variability in the data. Also, the ensemble becomes more robust as we aggregate more forecasts. However, we must train more models, resulting in higher computational costs.

Usually, an ensemble size of 20 to 50 is enough, balancing efficiency and accuracy.

Once we have decided on the ensemble size, we must create one subset of the data for each model in the ensemble. We get a subset by drawing with replacement from the original dataset.

Note, that we sample blocks instead of single values to account for the time dependency between observations. As we sample with replacement some blocks may appear multiple times, while others are absent.

The process of bootstrapping subsets from a time series. The time series is split into different blocks. To build each subset we sample blocks randomly with replacement. Hence, blocks can be repeated withing a subset. (Image by the author)

Once we have our bootstrap samples, we can train our ensemble models.

Step 2: Fit Bootstrap Ensemble Models

For each bootstrapped subset of the training data, we train one model, creating a diverse ensemble of models. As we train each model on different data, we will receive different predictions on the same input data. This diversity is key to robust estimations of the prediction interval.

Prediction Phase

Step 3: Leave-One-Out (LOO) Estimation

Once we have the trained ensemble we determine the ensemble's variance when forecasting on unseen data. We use the variance to calibrate our ensemble and decide the width of our prediction interval. For this, we follow the standard Conformal Prediction recipe. If you want to read more about the recipe in detail, I recommend the article below.

All You Need Is Conformal Prediction

The first step is to determine the non-conformity scores. In the case of EnbPI, the non-conformity score is the difference between the true and predicted value.

But on what data set do we calibrate the ensemble?

We do not have a calibration set. Remember that we trained each estimator in the ensemble on a different part of the training set. Thus, no estimator has seen all the data in the training set. Hence, we can use our training set for calibration.

For each observation in the training set, we make a prediction using the ensemble. However, we only use the ensemble estimators that have NOT seen the observation during training. Then we aggregate the predictions from these models using an aggregation function, e.g., mean or median.

The aggregation function affects the robustness of the predictions. The mean is generally sensitive to outliers but reduces the overall error. The median is robust against outliers and thus suitable for noisy data. Hence, we should choose the aggregation based on the data set and our use case.

Finally, we use the aggregated prediction to determine the non-conformity score for that particular observation. With this EnbPI uses out-of-sample errors as the non-conformity score. We will use these non-conformity scores to calibrate a forecast on unseen data in the next step.

Our EnbPI ensemble contains 4 models. To calibrate the ensemble we will make a prediction for each observation in the training set. For the first observation, we only use model 3 and 4 because the observation was in the training subset of model 1 and 2. We then take the mean of the predictions (violet dot). We repeat this process for all observations in the training set. For the second observation, we can use model 1, 3 and 4 as only model 3's training set contained the observation. And so on. Then we calculate the error between our ensemble mean and the true value, i.e., our non-conformity score. Based on the distribution and a chosen significance level alpha we determine the cut of value q. (Image by the author)

Step 4: Construct Prediction Intervals

For predictions on unseen data, we use each of the trained ensemble estimators. We aggregate the single predictions using the same aggregation function as in Step 3. The resulting value will be the center of our prediction interval.

To construct the prediction interval around the center, we use the distribution of residuals from step 3. We determine a cut-off value using a pre-defined significance level. The cut-off value is then added/subtracted from the predicted center to create our prediction interval.

This step should be familiar to you as most conformal prediction methods do it. If you are unfamiliar with it, I recommend the above article in which I describe the procedure in more detail.

(Optional) Step 5: Updating the non-conformity scores

The above-described approach uses the non-conformity scores we computed based on our training data.

We do not update them once we receive new data. Hence, the width of our prediction interval will not change over time as we do not add new information. This can be problematic if our underlying data or the model's performance changes.

To account for such changes, we can update the non-conformity scores as soon as new observations become available. For this, we do not need to retrain the ensemble estimators. We only need to compute the non-conformity scores. With this, they reflect the most recent data and model dynamics, resulting in an updated interval width.

Forecasting example using EnbPI

Now that we know how EnbPI works, let's apply the model to a forecasting task.

We will predict the wholesale electricity prices in Germany in the next day. The data is available from Ember under a CC-BY-4.0 license.

Luckily different packages, like MAPIE, sktime, and Amazon Fortuna have implemented EnbPI. Hence, it is straightforward for us to use the method. Here, I will use the EnbPI implementation from the mapie library.

Please note that I am not trying to get a forecast as accurate as possible but rather show how we can apply EnbPI.

Alright. Let's get started.

Let's import all the libraries and the datasets we need.

I also changed the data to mimic a data shift by 100 €/MWh in the last two weeks of the data set. This is to see how adaptive the prediction intervals are to sudden changes.

We will skip a detailed data exploration step here. However, we can see two seasonal components:

Daily: Prices are higher in the morning and evening hours as the electricity consumption is usually higher during these hours.
Weekly: Prices are higher on weekdays than on weekends as electricity consumption is usually higher on weekdays.

Based on that I derive the following datetime and lag features:

datetime features: day, month, year, and if the day is a weekend
lag features: same hour values of the past 7 days and the moving average over the past 24 hours lagged by 1 day

The next step is splitting our dataset into a training and test set. Although I only want to forecast the next 24 hours, I will use the last 30 days as my test set. This will give us a better idea of how the prediction intervals change over time.

Finally, I created a function to plot the results.

Ensemble model

Before we can apply EnbPI, we need to decide on an underlying model for the ensemble. I will use LightGBM but we could use any other model.

I will skip the hyperparameter optimization as we are not interested in the performance of the underlying model. However, the more accurate your model is, the better your prediction interval will be.

Training the EnbPI ensemble

Let's wrap EnbPI around the model. The implementation is straightforward.

First, we must create our bootstrap samples, using Mapie's BlockBootsrap class.

Here, we choose the number of blocks, i.e., how many models should be in the ensemble and the length of each block. I choose 20 blocks with a length equal to our forecast horizon of 24 hours. Moreover, I state that the blocks are not overlapping.

Second, we initialize the EnbPI model using Mapie's MapieTimeSeriesRegressor class. We pass in our model and define the aggregation function. Once we have initialized our model, we can fit it with the fit() method.

Once we have fitted the ensemble, we can run predictions using the predict() method. Besides passing in the the features of our test set we also pass in the significance level alpha. I use 0.05 which translates to the prediction interval should contain the true value with a probability of 95 %.

Let's look at the result.

Forecast results using EnbPI (Image by the author).

This looks good. The coverage is 91 % which is below our target and the width of the interval is 142.62 €/MWh. The coverage is below the target of 95 % probably because of the shift of the target in the middle of the test period. Moreover, we can see that the width of the interval does not change.

Updating the Non-Conformity Scores

We can easily update the non-conformity scores after new data becomes available. For this we can use Mapie's partial_fit() method.

We will need to update the code slightly. The only difference is that we now simulate that only next 24 hours of data from the test set becomes available.

Let's look at the result.

Forecast results using EnbPI with an online update of the residuals (Image by the author).

The results looks the same as above. The coverage is 91 % which is below our target and the width of the interval is 142.62 €/MWh. Unfortunately, the width of the interval also stays the same.

Conclusion

The article has been very long. Longer than I intended. But there was a lot to cover. If you stayed until here, you now should

have a good understanding of how the EnbPI method works, and
be able to use EnbPI in practice.

If you want to dive deeper into the EnbPI method, check out [this](https://arxiv.org/pdf/2010.09107) and this paper. Otherwise, comment and/or see you in my next article.

Tags: Conformal Prediction Data Science Hands On Tutorials Machine Learning Time Series Forecasting