3 Common Time Series Modeling Mistakes You Should Know

Author:Murphy | View: 23602 | Time: 2025-03-23 18:04:27

I've done it many times myself – hitting run on some model training code and having a "WOW" moment when the error scoring comes out great. Suspiciously great. Digging through the feature engineering code, there's a calculation that baked future data into the training data, and fixing the feature pumps those mean squared errors back up to reality. Now where's that whiteboard again…

Time series problems have a number of unique pitfalls. Luckily, with some diligence and a little practice, you'll be accounting for these pitfalls long before typing from sklearn import into your notebook. Here are three things to look out for, and some scenarios where you might run into them.

Look-Ahead Bias

This one's almost certainly the first hazard you'll encounter with time series, and overwhelmingly the most frequent one I see in entry-level portfolios (looking at you, generic stock market forecasting project). The good news is that it's generally the easiest to avoid.

The Problem: Simply put, look-ahead bias is when your model is trained using future data it would not have access to in reality.

The typical way you'd introduce this issue into your code is by randomly splitting training and testing data into two chunks of a predetermined size (e.g. 80/20). Random sampling will mean both your training and test data cover the same time period, so you'll have "leaked" knowledge of the future into your model.

When it comes time to validate with the test data, the model already knows what happens. You'll inevitably get some pretty stellar, yet bogus error scores this way.

The Fix: Split your dataset using a cutoff in time rather than holding out a percentage of the data.

For example, if I have data that covers 2013–2023, I might set 2013–2021 as my training data and 2022–2023 as my test data. In a simple use case, the test data then covers a time period the model is completely naive to, and your error scoring should be accurate. Remember, this also applies to the likes of k-fold cross-validation.

An alternative approach for cross-validation of a time series model is called "walk-forward" validation. Instead of holding out different samples, you start with a continuous subset of the data and hold out the tailing n-periods.

For example, if your data covers a seven year period, start with the opening three years, holding out the third year. Train the model on the opening two years and test on the third year. Then, "walk forward" through the data by moving the window and collecting the error scores as you go.

A walk-forward validation process with five iterations. Chart created with Plotly.

Examples

Stock Prices: Suppose you're developing a stock price prediction model that estimates the percentage change in stock price a month into the future. If you randomly split the data, the model already knows what the stock price will be over the next month, and you'll think you've hit gold when you get the RMSE scores back.

Customer Churn: If you're trying to predict an event and your labels are determined with knowledge of future events, you're also introducing look-ahead bias. For example, if you label a customer's eventual outcome as "churned" during a time period when they hadn't actually churned yet, then your model has access to future information in the training data.

Survivorship Bias

You might have heard the example of engineers calculating where to add armor to World War II bombers. When planes returned, engineers looked for where the aircraft frequently took heavy damage, to figure out where to place that armor.

Problematically, they were looking at where the surviving aircraft had taken damage and then made it home. The correct places to fortify were areas where the survivors were routinely not damaged, as planes being hit in those areas were not making it back.

Non-critical damage locations on a WW2 bomber. Note the complete lack of survivable damage on the rear fuselage section. Credit: Martin Grandjean (vector), McGeddon (picture), Cameron Moll (concept) on Wikipedia.

The Problem: If your training data only includes instances where something survived a process and ignores the times when the process was not completed, you've introduced survivorship bias.

The Fix: It's more complicated than avoiding look-ahead bias, and also more subjective. The key is fully understanding your data and the underlying processes behind what your data represent. Good questions to start with are:

Does my data include all outcomes, or at least the non-outlying or edge-case ones?
Am I measuring from when a process is initiated, or am I only using data after the process reaches a certain point?

If you can answer yes to all outcomes and you're measuring from the outset of a process, you're already much of the way there. The rabbit hole of potential "however" cases for those questions can run deep, but as long as you take the time to build up the picture of your processes, you should be able to avoid the most obvious issues.

Examples

Online Marketing: An online store is trying to figure out how to refine who they target with advertising for their product. If they only look at their users who have converted, i.e., what worked, they're ignoring what might be causing people to not convert.

Insurance Risk Analysis: An analyst is studying housing prices over time in a certain region. They use a current map and so only consider neighborhoods that have survived without major incidents (like natural disasters, economic decline, etc.). They will probably underestimate the risk and overestimate the return of real estate investment in that region.

Stock Prices: Another stock example, but a different angle. A trader is creating an algorithm to predict prices of all the stocks in the S&P 500. If the trader uses the current roster of companies, the list only includes companies that have made it to now without shutting down or losing so much value they drop off the list. The trader should use the list of companies as it was at the start of the training data.

Ignoring Autocorrelation

Autocorrelation is the lurker of time series pitfalls. It's not as conspicuous as look-ahead bias, but equally destructive. Like look-ahead bias (and less like survivorship bias), it's totally manageable once you know what to look for, though perhaps more involved in terms of modeling itself.

The Problem: Autocorrelation is the correlation of a series with itself, i.e., how much is a data point influenced by previous data points in the same series? In an extremely simple example, a sine wave can be looked at as a lagged cosine wave, and so the sine and cosine waves are autocorrelated.

A sine wave and cosine wave. Chart created with Plotly.

Traditional models mostly assume independence between observations, and so many newcomers to time series analysis apply regular regression techniques right out of the gate, assuming observations are indeed independent.

The Fix: The simple and effective fix is to employ models that take into account the temporal relationships in the data. This could involve moving average models, autoregressive models, or more advanced techniques like ARIMA (Autoregressive Integrated Moving Average) and SARIMA (Seasonal ARIMA) models. These models recognize the autocorrelation and leverage it to make more accurate predictions.

Just like we looked at adapting our cross-validation technique to account for look-ahead bias, we need to consider autocorrelation in our validation approach. In many cases, you can combine the walk-forward validation approach with models that handle autocorrelation to produce a robust solution.

Examples

Weather Patterns: Large-scale weather systems tend to move slowly, sometimes not at all for many days or weeks. Unless you live in a consistently volatile climate (think the UK or Pacific Northwest), the weather tomorrow will more likely than not be similar to today.

For 9 months of the year, a sunny day in Los Angeles will almost certainly be followed by another sunny day. A day of wet-season afternoon thunderstorms in hot and humid Singapore will probably be followed by a day with more afternoon thunderstorms.

Energy Consumption: Electricity demand in a given hour is usually similar to the demand in previous days. Power usage at 3 am one day won't suddenly spike to 6 pm usage at 3 am the following day. Similarly, the gas consumption of a city in a given month is likely to be similar to the consumption in the same month in previous years, reflecting the influence of seasonality.

Web Traffic: The number of users online at a given time of day is likely to be similar to the number online at the same time on previous days. For example, a site dedicated to reporting football scores will reliably see its peak traffic repeat each week on weekend afternoons for as long as the season is active.

So, there you have it. Three common pitfalls, and a few ways to avoid them. Happy modeling!