Understanding the Limitations of ARIMA Forecasting
Table of Contents
- Introduction
- ARIMA in a Nutshell
- Key Assumptions and Limitations of the ARIMA Model
- Forecasting Process in Python: Seasonal ARIMA vs. Facebook Prophet
- Performance Comparison Between Two Models
- Conclusion
Introduction
Time Series Forecasting has long been a vital tool for making predictions in businesses, especially before the rise of Machine Learning (ML). It plays an important role in areas such as resource allocation, inventory management, and budget planning. Often, forecasting is described as a blend of art and science because it requires both technical understanding and human intuition and judgement.
Among time series forecasting techniques, ARIMA (Auto Regressive Integrated Moving Average) has been one of the most popular methods, originally developed by George Box and Gwilym Jenkins in 1970. However, despite its popularity, ARIMA has certain drawbacks. This article aims to compare the performance of ARIMA with the Prophet model, developed by Facebook, while highlighting the limitations of the ARIMA model.
It is important to note that the objective of this article is not to prove that the Facebook Prophet model is universally better than the ARIMA model. Rather, the goal is to shed light on the nuances of ARIMA, allowing for a more informed choice when selecting a forecasting method.
ARIMA in a Nutshell
Before diving into the concept, let me briefly explain what the ARIMA model is. In short, ARIMA (AutoRegressive Integrated Moving Average) is a forecasting method that integrates time series techniques with elements of linear regression.
ARIMA consists of three components: AutoRegressive (AR – "p"), Differencing ("d"), and Moving Average (MA – "q").
Simply put, ARIMA forecasts future data points by specifically leveraging time lags (past values) , AR term, and past forecast errors, MA term. The standard ARIMA model relies heavily on past data, which makes it less flexible in adapting to new information such as external factors, holiday effects, and other irregular events.
Key Assumptions and Limitations of the ARIMA Model
Stationarity
The main assumption of the ARIMA model is that the time series data is stationary. In other words, the statistical properties (e.g., mean, standard deviation, variance) of the time series data need to remain consistent over time. This implies that data with trends or seasonality cannot be directly fed into the ARIMA model because its mean or variance will systematically change over time.
Why do ARIMA models struggle with non-stationary data? There isn't an intuitive explanation, but the mathematical framework and theory underpinning ARIMA models are designed to handle stationary data only. Hence, stationary data (a) simplifies the ARIMA modelling process and (b) leads to more reliable predictions.
Please note that this does not mean non-stationary data cannot be used with ARIMA. We can transform non-stationary data into stationary data through a process called differencing, which corresponds to the "d" term in ARIMA.
Linearity
Another core assumption is linearity. Since ARIMA is a linear combination of past values and past forecast errors, the relationship between the input variables (the lags) and the output variables (the forecasts) is assumed to be linear.
However, the world is not always linear. If the data is non-linear – meaning the relationship between inputs and outputs is multiplicative or not proportional – the ARIMA model may not capture these complex patterns, resulting in sub-optimal forecasts. For instance, during the 2008 financial crisis, ARIMA models showed 30% higher prediction errors compared to normal market conditions [1] , highlighting their inability to capture sudden market shifts or dynamics in non-linear data.
Additional Limitations
- While ARIMA is powerful for short-term forecasts, it may not be suitable for long-term forecasting. This limitation arises from both the MA and AR components. The MA component, which partially relies on past forecast errors, causes these errors to accumulate over time, leading to less accurate forecasts as the forecast horizon increases. Similarly, for the AR component, the influence of current values on future forecasts diminishes as the forecast horizon lengthens.
- Parameter selection is another challenge, as it introduces subjectivity and requires considerable expertise to select suitable parameters. One study showed that uncertainty in parameter selection led to forecast errors ranging from 10% to 40% [1]. However, the good news is that there are now automated tools available in both Python and R, which simplify the process and reduce the need for manual parameter tuning.
Forecasting Process in Python: Seasonal ARIMA vs. Facebook Prophet
Data used
To demonstrate the key points of this analysis, I chose an open-source dataset that exhibits both seasonality and non-linearity to evaluate how ARIMA performs on it. Subsequently, we will compare the performance with the Facebook Prophet model, focusing on accuracy and reliability. For the error metric, I will use the Mean Absolute Error (MAE) as a primary metric because Mean Absolute Percentage Error (MAPE) can be biased due to its asymmetric error penalties between low and high values.
The dataset used is from Kaggle which contains monthly beer production data from 1956 January to 1995 August. First, we'll examine the time series patterns. It seems that the data shows an upward trend with yearly seasonality, followed by a slight decline toward the end of the time series.

Next, the dataset will be split into training and test sets. For the test set, I will select the last 1 year and 8 months of data ( 1994 – Jan to 1995 -August ) to evaluate the model's performance on one full year.
Here, make sure you don't use a random split, as it is not logical to use data points from future to make predictions on the past. The temporal order of the data is critical in time series analysis so the test set must always be placed at the end to maintain the order.
Seasonal ARIMA
1. Stationarity
First, we need to check whether the time series is stationary. Based on the time series graph, it is highly likely that the data is not stationary due to the presence of seasonality and an upward trend. We can perform a visual check using a decomposition chart on the training data, which will reveal components such as trend and seasonal patterns.

The decomposition confirms the presence of a trend and yearly seasonality. To validate this observation more precisely, we can use a statistical test called the Augmented Dickey-Fuller (ADF) test. The null hypothesis of this test is that the data is non-stationary.
adf_result = adfuller(train['Monthly beer production'].dropna())
print('ADF Statistic: %f' % adf_result[0])
print('p-value: %f' % adf_result[1])
Results:
ADF Statistic: -2.214546
p-value: 0.201008
In our case, the test results show that the p-value is greater than the significance level (0.05). This means there is not enough evidence to reject the null hypothesis, confirming that the data is non-stationary.
At this point, we could proceed directly to the auto ARIMA process, which automatically handles parameter selection – such as determining whether differencing is needed, the order of differencing, and so on. However, I'm walking through these steps to provide a better understanding of the ARIMA process.
2. Parameters Selection
Next, since the data is non-stationary, it needs to be transformed using the differencing method. This process removes trends or seasonality by stabilising the mean of the series.

1st Order Differencing ADF Results
ADF Statistic: -4.805417
p-value: 0.000053
After applying first-order differencing, the p-value from the ADF test drops well below the 0.05 significance level, indicating that the data has become stationary. Therefore, a differencing order of d=1 should be sufficient.
To select the AR term (p) and MA term (q) parameters, we will refer to the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots.
- ACF Plot: The ACF plot shows the correlation between the current value of the series at time t and its past values at various lags (i.e., t-1, t-2, etc.). In stationary data, the ACF plot should ideally show a rapid decline in correlation after a few lags, indicating no strong correlation patterns beyond these lags.
- PACF Plot: The PACF plot shows the correlation between the current value of the series at time t and its past values, after removing the influence of earlier lags. Essentially, it reveals the direct relationship between the series and a lagged version of itself, controlling for the effects of intermediate lags.

By examining the ACF and PACF plots, we can determine the MA term (q) and AR term (p), respectively. However, as mentioned earlier, selecting these parameters can be challenging. While I will not delve into the details of interpreting these plots here, we typically look for visual cues such as statistically significant spikes (spikes that extend beyond the blue confidence thresholds) and cut-off points to determine the orders.
For instance, in this scenario, the ACF plot shows a drop-off after lag 1, suggesting that an MA order of q = 1 might be appropriate. However, the lack of a clear cut-off point or tail-off complicates this decision. Similarly, in the PACF plot, there are some significant spikes, but no clear cut-off point, which may indicate that the model could be complex.
3. Auto-ARIMA using pmdarima Python Package
Fortunately, today we have tools like the pmdarima package in Python and autoarima in R that automate the ARIMA model selection process. These tools use the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) to find the optimal order of the model, making the process more objective.
# Find the optimal ARIMA parameters
arima = pm.auto_arima(train, error_action='ignore', trace=True,
suppress_warnings=True, maxiter=5,
seasonal=True, m=12)
The auto-ARIMA process provides several order options along with their corresponding AIC values. According to auto-ARIMA, the best model in this case is seasonal ARIMA (1,1,5)(2,0,2)[12].
4. Time Series Cross-validation
After selecting the model, it is validated using the Rolling Window Time-series Cross-validation method with a 1-year forecast horizon. In time series cross-validation, traditional k-fold cross-validation cannot be used due to the importance of maintaining temporal order. It is just as the same reason why we can't use random split. The Rolling Window method works by shifting the forecast window period (in this case) by 1 year after each forecast (1 year forecast).
model = pm.ARIMA(order=(1, 1, 5),
seasonal_order = (2,0,2,12),
suppress_warnings = True)
cv = model_selection.SlidingWindowForecastCV(window_size=96, step=12, h=12)
model_cv_scores = model_selection.cross_val_score(
model1, train, scoring='mean_absolute_error', cv=cv, verbose=2)
print("Model CV scores: {}".format(model_cv_scores.tolist()))
# Calculate the average MAE across the cross-validation folds
m_average_error = np.average(model_cv_scores)
# Calculate and print the average MAE across the cross-validation folds
m_average_error = np.mean(model_cv_scores)
print("Average MAE for Model: {:.0f}".format(m_average_error))
The observed increase in MAE across the CV folds likely indicates that the model's performance deteriorates as it approaches the end of the time series, as seen in the graph. This could possibly due to the presence of sudden change of trend in the time series. Another observation is that the performance is not stable, MAE ranging from 2 to 20.

Results:
Average MAE: 9
5. Predictions
Finally, let's check the model's performance on the test set.
# Forecast on the test set
forecast = final_model.predict(n_periods=len(test))
# Evaluate the forecast using MAE
mae_test = mae(test, forecast)
print(f"MAE on Test Set: {mae_test:.0f}")
Results:
MAE on Test Set: 19
Surprisingly, the MAE increased from 9 (cross-validation) to 19 (test set). Although a slight change in the forecast horizon between the cross-validation and test sets could account for some difference, the significant increase in MAE indicates that the model may be overfitting and may not perform well on unseen data, as the error has doubled.
Let's move on to Prophet.
Facebook Prophet
Facebook Prophet model is a time series forecasting model developed by the Facebook Core Data Science team. As opposed to ARIMA, Prophet is easy to get started without needing specialised knowledge because the model is fully automated. It is robust to outliers, missing data, holiday effects, and dramatic changes in trends. Prophet can be considered as a non-linear regression.
1. Data Preparation
The Prophet model requires specific data formatting, such as renaming the date column to ds
and the time series values to y
.
df_prophet = df.reset_index().rename(columns={'Month': 'ds', 'Monthly beer production': 'y'})
2. In-sample Predictions
Prophet has a built-in plot feature to visualise the in-sample predictions alongside the actual data points. The black dots represent the actual data points, while the blue line indicates the Prophet forecast. The light blue shaded area represents the uncertainty interval (or confidence interval) – the narrower the shade, the higher the confidence in the forecast.
According to this chart, the Prophet model seems to captures the broader trend and seasonality but struggles slightly with capturing certain points after the trend shifts (after 1980) from an upward trend to a more stabilised trend.

3. Time Series Cross-validation
We will apply cross-validation on the training set to observe how the error metrics vary across different forecast horizons. One limitation of the Prophet model is that it is primarily designed for daily data, so some modifications are necessary when working with monthly data. Although Prophet can automatically handle cut-off dates during cross-validation, I manually set the cut-off dates to be annual since we're using monthly data. Similar to the ARIMA model, we'll start with 8 years of monthly data as the initial training set. The model will then forecast for the following year, and the forecast period will shift by one year with each iteration.
from prophet.diagnostics import cross_validation, performance_metrics
# Define custom cutoffs for annual evaluation, starting after the initial period of 8 years
cutoffs = pd.date_range(start='1964-01-01', end='1992-12-01', freq='12MS')
# Fit the Prophet model
model_prophet = Prophet(yearly_seasonality = True, seasonality_mode = 'additive')
model_prophet.fit(prophet_train)
# Run cross-validation
df_cv = cross_validation(model_prophet, cutoffs=cutoffs, horizon='365 days', initial='2922 days')
# Calculate performance metrics
df_performance = performance_metrics(df_cv)
print(df_performance)
# Calculate and print the average MAE
average_mae = df_performance['mae'].mean()
print(f'Average MAE: {average_mae :.0f}')
Results:
Average MAE: 9
4. Predictions
Let's see how Prophet model performs on the test set:
# Create predictions
future_df = model_prophet.make_future_dataframe(periods=len(prophet_test), freq='MS')
prophet_forecast = model_prophet.predict(future_df)
prophet_forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].round().head()
# Extract the forecasted values for the test set period
forecast_test = prophet_forecast[['ds', 'yhat']].iloc[-len(prophet_test):]
Results:
Average MAE on Test Set: 10
The result (10 MAE ) does not deviate much from the cross-validation result which is 9 MAE. So, it seems to perform well on unseen data.
Performance Comparison Between Two Models
The following chart shows the comparison of predictions on test set between ARIMA and Prophet.


Conclusion
To sum up, this article explored some of the limitations and challenges of the ARIMA model by comparing it with the Prophet model. We discussed the assumptions and the modelling process behind ARIMA and observed where it struggled, particularly in handling non-linear data with strong seasonality and shifting trends.
While our findings might suggest that Prophet is superior in most aspects, this is not always the case. There are studies where ARIMA has outperformed Prophet, depending on the specific characteristics of the data. The key takeaway from this article is that selecting the right forecasting model depends on thoroughly understanding the nature of your data and the assumptions behind each method.
Ultimately, the decision between ARIMA and Prophet – or any other forecasting tool – should be driven by the unique requirements of your data and the objectives of your forecast.
References
- Petrică, A. C., Stancu, S., & Tindeche, A. (2016). Limitation of ARIMA models in financial and monetary economics. Theoretical & Applied Economics, 23(4).
- Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: principles and practice. OTexts.
- Valenzuela, O., Rojas, F., Pomares, H., & Rojas, I. (2019). Theory and Applications of Time Series Analysis. Springer International Publishing.
- Dataset Source: Data sets from "Forecasting: methods and applications" by Makridakis, Wheelwright & Hyndman (1998); License Information: GPL-3
Thank you for reading!
Disclaimer: While I've tried my best to unpack the theories accurately, I may still make some mistakes. Please feel free to let me know if you spot any errors or gaps in my theoretical knowledge.