Dynamic Conformal Intervals for any Time Series Model

Author:Murphy | View: 23025 | Time: 2025-03-23 18:55:09

Depending on the purpose of generating a forecast, evaluating accurate confidence intervals can be a crucial task. Most classic econometric models, built upon assumptions about distributions of predictions and residuals, have a way to do this built in. When moving to machine learning to do time series, such as with XGBoost or recurrent neural networks, it can be more complicated. A popular technique is conformal intervals – a method of quantifying uncertainty that makes no assumptions about prediction distributions.

Naive Conformal Interval

The simplest implementation of this method is to train a model and hold out a test set. If this test set is at least 20 observations (assuming we want 95% certainty), we can build an interval by placing a plus/minus value on any point prediction that represents the 95th percentile of the test-set residual absolute values. We then refit the model on the entire series and apply this plus/minus to all point predictions over the unknown horizon. This can be thought of as a naive conformal interval.

Scalecast is a Forecasting library in Python that works well if you want to transform a series before applying an optimized machine or deep learning model on the time series, then easily revert the results. While other libraries offer flexible and dynamic intervals for ML models, I'm not sure they are built to effectively handle data that needs to be transformed and then reverted, especially with differencing. Please correct me if I'm wrong.

I made scalecast specifically for this purpose. Transforming and reverting series is ridiculously easy with the package, including options to use cross validation to find optimal transformation combinations. However, applying a confidence interval, any interval, at a differenced level becomes complicated to map onto the series' original level. If you simply undifference the interval the same way you would the point predictions, it will more-than-likely be unrealistically wide. A suggestion to avoid this is to use models that don't require stationary data – ARIMA, exponential smoothing, and others. But that's no fun if you really want to compare ML models and your data is not stationary.

The solution around this that scalecast uses is the naive conformal interval described above. If a series first, second, or seasonal difference is taken and then reverted, re-calculating test-set residuals and applying percentile functions on them is simple. I evaluated the efficacy of this interval using a measure called MSIS in a past post.

pip install --upgrade scalecast

But, this could be better. In time series, it is intuitive to believe that an interval will expand further out for a point prediction that is further away temporally from a baseline truth, once errors accumulate. It is easier to predict what I will do tomorrow than it is a month from now. That intuitive concept is built into econometric methods but absent from the naive interval.

We could try to reconcile this problem in several ways, one of them being conformalized quantile regression, such as utilized by Neural Prophet. That may make its way to scalecast one day. But the method I will overview here involves using backtesting and applying percentiles depending on residuals from each backtest iteration. As opposed to employing assumptions, the method grounds everything in some observed, empirical truth – a true relationship between the implemented model and the time series it is being implemented on.

Backtested Conformal Interval

To do this, we need to split our data into several training and test sets. Each test set will need to be the same length as our desired forecast horizon and the number of splits should equal at least one divided by alpha, where alpha is one minus the desired confidence level. Again, this would come out to 20 iterations for intervals of 95% certainty. Considering we need to iterate through the entire length of our forecast horizon 20 or more times, shorter series may run out of observations before this process finishes. A way to mitigate this is if we allow test sets to overlap. As long as test sets are at least one observation different from one another and no data leaks from any of the training sets, it should be okay. This may bias the interval towards more recent observations, but the option can be left open to add more space between training sets if the series contains enough observations to do so. The process I explained is referred to as backtesting, but it can also be thought of as modified time-series cross validation, which is a common way to facilitate more accurate conformal intervals. Scalecast makes the process of obtaining this interval easy through pipelines and three utility functions.

Build Full Model Pipeline

First we build a pipeline. Assuming we want differenced data and to forecast with the Xgboost model, the pipeline can be:

transformer = Transformer(['DiffTransform'])
reverter = Reverter(['DiffRevert'],base_transformer=transformer)

def forecaster(f):
    f.add_ar_terms(100)
    f.add_seasonal_regressors('month')
    f.set_estimator('xgboost')
    f.manual_forecast()

pipeline = Pipeline(
    steps = [
        ('Transform',transformer),
        ('Forecast',forecaster),
        ('Revert',reverter)
    ]
)

It is important to note that this framework can also be applied to deep learning models, classic econometric models, RNNs, and even naive models. For any model you want to apply on time series, this will work.

Next, we fit_predict() the pipeline, generating 24 future observations:

f = Forecaster(
    y=starts, # an array of observed values
    current_dates=starts.index, # an array of dates
    future_dates=24, # 24-length forecast horizon
    test_length=24, # 24-length test-set for confidence intervals
    cis=True, # generate naive intervals for comparison with the end result
)

f = pipeline.fit_predict(f)

Backtest Pipeline and Build Residual Matrix

Now, we do the backtest. For 95% intervals, this means at least 20 train/test splits iteratively moving backward through the most recent observations. This is the most computationally expensive part of the process, depending on how many models we want to send through our pipeline (we can place more by expanding the forecaster() function), if we want to optimize each model's hyperparameters, and if we want to use multivariate processes, it can take a while. On my Macbook, this simple pipeline takes a little over a minute to backtest with 20 iterations.

backtest_results = backtest_for_resid_matrix(
    f, # one or more positional Forecaster objects can go here
    pipeline=pipeline, # both univariate and multivariate pipelines supported
    alpha = 0.05, # 0.05 for 95% cis (default) 
    bt_n_iter = None, # by default uses the minimum required: 20 for 95% cis, 10 for 90%, etc.
    jump_back = 1, # space between training sets, default 1
)

The backtest results from this function can be multipurpose. We can use them to report average errors from our model(s), we can use them to glean insight about many out-of-sample predictions, or we can use them to generate intervals. To generate intervals, we do:

backtest_resid_matrix = get_backtest_resid_matrix(backtest_results)

This creates one matrix for each evaluated model with a row that represents each backtest iteration and a column for each forecast step. The value in each cell is a prediction error (residual).

Applying a column-wise percentile function, we can generate plus/minus values to find the 95th percentile of the absolute residuals along each forecast step. On average, this should be a larger value the further out the forecast goes. In our example, the plus/minus is 15 for step 1, 16 for step 4, and 46 for step 24 (the last step). Not all values are progressively larger than the last, but they generally are.

Construct Backtested Intervals

We then overwrite the stale naive intervals with the new dynamic ones.

overwrite_forecast_intervals(
    f, # one or more positional Forecaster objects can go here
    backtest_resid_matrix=backtest_resid_matrix,
    models=None, # if more than one models are in the matrix, subset down here
    alpha = .05, # 0.05 for 95% cis (default)
)

Voila! We have an assumption-free and dynamic conformal interval built for our time series model.

How much better is this interval than the default? Using MSIS, a measure not many know about or use, we can score each obtained interval before and after this process. We can also use the coverage rate from each interval (the percent of actual observations fell within the interval). We've set aside a separate slice of the data, which overlaps with none of our previously evaluated test sets, for just this purpose. The naive interval looks like:

This ended up being an accurate forecast with a tight interval. It contained 100% of the actual observations and scored an MSIS of 4.03. From my limited usage of MSIS, I think anything under 5 is usually pretty good. We apply the dynamic interval and get the following:

This is good. We have an expanding interval that is tighter on average than the default interval. The MSIS score improved slightly to 3.92. The bad news: 3 out of the 24 test-set observations fall out of this new interval's range for a coverage score of 87.5%. For a 95% interval, that may not be ideal.

Conclusion

Of course, this is just one example so we should be careful about drawing too broad conclusions. I do feel confident that the backtested interval will almost always expand out, which makes it more intuitive than the default interval. It probably on average is more accurate as well. It just costs more computational power to obtain.

In addition to gaining new intervals, we also obtained backtest information. Over 20 iterations, we observed the following error metrics:

We can feel better about reporting these than the errors from just one test set.

Thank you for following along! If you like this content, it would mean a lot to me if you gave scalecast a star on GitHub and check out the full notebook that accompanies this article. The data used is publicly available through FRED.