Imagine you have multiple time series to forecast. Which model should you use for each one? ARIMA? ETS? Theta? Or just a simple Naive baseline?
Using one model for all series is easy but sacrifices accuracy since each series has different patterns. However, manually testing different models for each series creates problems:
- Time consuming: Hours spent fitting and comparing individual models
- Expertise required: Each algorithm needs different parameter configurations
- Inconsistent evaluation: Different validation approaches across models
- Doesn't scale: Manual testing for hundreds or thousands of series is impractical
StatsForecast automates this process by fitting multiple statistical models simultaneously, then using cross-validation to select the best performer for each time series.
This article demonstrates how to use StatsForecast's automatic model selection with the M4 hourly competition data, then compares it against TimeGPT, Nixtla's foundation model.
The source code of this article can be found in the interactive Jupyter notebook.
Introduction to StatsForecast
StatsForecast is an open-source Python library that automates statistical forecasting through:
- Automatic model selection: AutoARIMA, AutoETS, AutoCES, and AutoTheta optimize parameters automatically
- Speed: 20x faster than pmdarima, leveraging Numba for performance
- Scale: Handles hundreds or thousands of series with parallel processing
The library follows scikit-learn's familiar .fit() and .predict() API pattern.
To install StatsForecast, run:
pip install statsforecast utilsforecast
Additional dependencies for this tutorial:
pip install pandas numpy
Setup - M4 Hourly Competition Data
Import the necessary libraries:
import numpy as np
import pandas as pd
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA, AutoETS, AutoCES, AutoTheta, Naive, SeasonalNaive
from utilsforecast.plotting import plot_series
from utilsforecast.evaluation import evaluate
from utilsforecast.losses import mae, rmse, smape, mase
Load the M4 hourly benchmark dataset, which contains 414 hourly time series ranging from 700 to 960 observations each:
Y_train_df = pd.read_csv('https://auto-arima-results.s3.amazonaws.com/M4-Hourly.csv')
Y_test_df = pd.read_csv('https://auto-arima-results.s3.amazonaws.com/M4-Hourly-test.csv')
Y_train_df['ds'] = pd.to_datetime('2024-01-01') + pd.to_timedelta(Y_train_df['ds'], unit='h')
Y_test_df['ds'] = pd.to_datetime('2024-01-01') + pd.to_timedelta(Y_test_df['ds'], unit='h')
Sample 8 series for demonstration:
n_series = 8
uids = Y_train_df['unique_id'].drop_duplicates().sample(8, random_state=23).values
df_train = Y_train_df.query('unique_id in @uids')
df_test = Y_test_df.query('unique_id in @uids')
print(f"Training observations: {len(df_train)}")
print(f"Test observations: {len(df_test)}")
Training observations: 6640
Test observations: 384
Define evaluation metrics with hourly seasonality:
- MASE (Mean Absolute Scaled Error): Scaled error vs seasonal baseline (< 1.0 = beats baseline, = 1.0 = matches baseline, > 1.0 = worse than baseline)
- RMSE (Root Mean Squared Error): Measures the magnitude of prediction errors
- SMAPE (Symmetric Mean Absolute Percentage Error): Calculates percentage-based accuracy
from functools import partial
hourly_mase = partial(mase, seasonality=24)
metrics = [hourly_mase, rmse, smape]
Visualize the selected time series:
plot_series(df_train, df_test.rename(columns={"y": "actual"}), max_ids=4)
Each series shows different patterns. Some have strong daily cycles, others trend up or down over time, and some are quite volatile. This variety is why selecting the right model for each series matters.
Baseline Models - Naive and SeasonalNaive
Before diving into complex models, start with simple baselines:
- Naive: Uses the last observed value as the forecast
- SeasonalNaive: Captures seasonal patterns by repeating values from the previous cycle (24 hours ago for hourly data)
These baselines provide a performance floor. Any sophisticated model should beat these simple approaches.
sf_base = StatsForecast(
models=[Naive(), SeasonalNaive(season_length=24)],
freq='H',
n_jobs=-1
)
fcst_base = sf_base.forecast(df=df_train, h=48)
eval_base = df_test.merge(fcst_base, on=['unique_id', 'ds'])
Visualize baseline predictions:
plot_series(df_train, eval_base, max_ids=4, max_insample_length=5*24)
The plot shows 5 days of historical data followed by 48-hour forecasts. The cyan line shows SeasonalNaive following the daily rhythm, while the pink Naive line stays flat at the last value.
Evaluate baseline performance:
metrics_base = evaluate(
df=eval_base,
train_df=df_train,
metrics=metrics,
agg_fn='mean',
).set_index('metric')
metrics_base
The SeasonalNaive model performs significantly better than Naive, achieving MASE close to 1.0. This suggests strong seasonal patterns in the hourly data that repeating values from 24 hours ago captures effectively.
Advanced Statistical Models
Let's move beyond baselines with more sophisticated models:
- AutoARIMA: Handles autocorrelation, trends, and seasonality
- AutoETS: Auto-selects exponential smoothing components
- AutoCES: Offers flexible cyclical pattern modeling
- AutoTheta: Fast, robust forecasting that often wins competitions
These automatic models eliminate manual parameter tuning while ensuring forecasts meet statistical standards for accuracy and reliability.
Configure the automatic models:
models = [
AutoARIMA(season_length=24),
AutoETS(season_length=24),
AutoCES(season_length=24),
AutoTheta(season_length=24)
]
Fit all models and generate forecasts in one step:
sf = StatsForecast(
models=models,
freq='H',
n_jobs=-1
)
fcst_sf_models = sf.forecast(df=df_train, h=48, level=[90])
eval_sf_models = df_test.merge(fcst_sf_models, on=['unique_id', 'ds'])
The forecast() method automatically fits each model to every series, optimizes parameters, and generates predictions with uncertainty intervals in a single function call.
Visualize predictions from all models:
plot_series(df_train, eval_sf_models, max_ids=4, max_insample_length=5*24)
Unlike the simple baselines, these models adapt to the data's complexity and follow the actual patterns much more closely.
Evaluate performance across all models:
metrics_sf_models = evaluate(
df=eval_sf_models,
metrics=metrics,
train_df=df_train,
agg_fn='mean',
).set_index('metric')
metrics_sf_models
AutoCES achieves the lowest MASE (0.73) and RMSE (60.98), outperforming both AutoARIMA and baseline models. All automatic models beat the Naive baseline, demonstrating that automated parameter optimization works effectively.
Compare all models visually:
from utils import plot_metric_bar_multi
plot_metric_bar_multi(dfs=[metrics_sf_models, metrics_base], metric='mase')
This bar chart shows MASE (Mean Absolute Scaled Error), where values below 1.0 beat the baseline and values above 1.0 perform worse:
- Naive (MASE 8.03): Performs worst, far below the seasonal baseline
- AutoTheta (MASE 1.87) and AutoETS (MASE 1.33): Above 1.0, don't beat SeasonalNaive
- AutoARIMA (MASE 0.80) and AutoCES (MASE 0.73): Below 1.0, successfully outperform the baseline
Cross-Validation for Model Selection
While AutoCES performs best on average, different models might work better for individual series. Cross-validation helps identify the optimal model for each time series.
Traditional cross-validation uses random splits, which doesn't work for time series. Randomly shuffling would let the model peek into the future, creating data leakage. StatsForecast fixes this by ensuring models only train on historical data, never future values.
Time series cross-validation uses a rolling window approach:
- Train all models on initial window (yellow in the visualization below)
- Forecast the next period (red region) and evaluate forecast accuracy
- Slide the window forward in time and add new data to training set
- Repeat: train, forecast, evaluate
- Compare models across all windows and select the best for each series
Run cross-validation with rolling windows:
cv_df = sf.cross_validation(
df=df_train,
h=24,
step_size=24,
n_windows=2
)
print(f"Cross-validation results: {cv_df.shape}")
cv_df.head()
Cross-validation results: (384, 11)
Evaluate model performance and select the best for each series:
from utils import evaluate_cv, get_best_model_forecast
evaluation_df = evaluate_cv(cv_df, mae)
evaluation_df['best_statsforecast_model'].value_counts().to_frame().reset_index()
AutoARIMA was selected as the best model for 3 out of 8 series, while AutoETS and AutoTheta each won 2 series. This demonstrates that different models excel on different time series patterns.
Best Model Forecasts with Prediction Intervals
After selecting the best model for each series, let's generate final forecasts with uncertainty intervals:
best_fcst_sf = get_best_model_forecast(fcst_sf_models, evaluation_df)
eval_best_sf = df_test.merge(best_fcst_sf, on=['unique_id', 'ds'])
plot_series(df_train, eval_best_sf, level=[90], max_insample_length=5*24, max_ids=4)
These are the forecasts from the best model for each series. The shaded bands represent 90% prediction intervals:
- Tighter bands: More confidence in the forecast
- Wider bands: More uncertainty in the forecast
Calculate metrics for the best model selection approach:
metrics_sf_best = evaluate(
df=eval_best_sf,
train_df=df_train,
metrics=metrics,
agg_fn='mean',
).set_index('metric')
Compare baseline, individual models, and best model selection:
plot_metric_bar_multi(dfs=[metrics_sf_models, metrics_base, metrics_sf_best], metric='mase')
The green bar, which represents the best model selection, has the lowest MASE. By selecting the best model for each series, we get stronger performance than applying a single model to every time series.
TimeGPT - Foundation Model for Time Series
Now let's try TimeGPT, Nixtla's foundation model for time series forecasting. It's pre-trained on millions of diverse series and requires minimal tuning.
Why TimeGPT?
- Strong out-of-the-box accuracy with minimal tuning
- Handles trend/seasonality/holidays automatically
- Scales to many series with simple APIs
Setup TimeGPT Client
First, install the Nixtla package:
pip install nixtla
Initialize the TimeGPT client using an API key:
import os
from dotenv import load_dotenv
from nixtla import NixtlaClient
load_dotenv()
api_key = os.getenv("NIXTLA_API_KEY")
nixtla_client = NixtlaClient(api_key=api_key)
This connects to Nixtla's cloud service where the foundation model runs. The setup is simple: just import and authenticate.
Zero-Shot Forecast with TimeGPT
Zero-shot forecasting means using the pre-trained model directly without any training on your specific data. This approach saves time by eliminating the training step while still leveraging patterns learned from millions of diverse time series.
The code below generates 48-hour forecasts with 80% and 90% prediction intervals in a single API call:
fcst_timegpt = nixtla_client.forecast(
df=df_train,
h=48,
freq='H',
level=['80', '90']
)
fcst_timegpt.head()
Fine-Tuned Forecast with TimeGPT
Fine-tuning adapts the pre-trained model to your specific data patterns by running additional training steps on your dataset. This often improves accuracy for domain-specific patterns that differ from the general patterns TimeGPT learned during pre-training.
To fine-tune TimeGPT, simply add the finetune_steps parameter. Here we use 10 finetune steps:
fcst_timegpt_ft = nixtla_client.forecast(
df=df_train,
h=48,
freq='H',
level=['80', '90'],
finetune_steps=10
)
TimeGPT-2 - The Latest Foundation Model
TimeGPT-2, Nixtla's latest foundation model, brings enhanced forecasting accuracy and performance improvements. The model is currently available by invitation only.
To use TimeGPT-2, initialize the client with the preview API endpoint:
nixtla_client = NixtlaClient(
api_key=api_key,
base_url='https://api-preview.nixtla.io'
)
To use TimeGPT-2, simply add the model='timegpt-2' parameter to the API call:
fcst_timegpt_2 = nixtla_client.forecast(
df=df_train,
h=48,
freq='H',
level=['80', '90'],
model='timegpt-2'
)
eval_tgpt_2 = df_test.merge(fcst_timegpt_2, on=['unique_id', 'ds'])
Visualize TimeGPT-2 predictions:
fig = nixtla_client.plot(
df_train,
eval_tgpt_2,
level=['80', '90'],
max_insample_length=5*24,
max_ids=4
)
Compare the performance of the different TimeGPT variants:
metrics_tgpt = evaluate(
df=df_test
.merge(fcst_timegpt.rename(columns={'TimeGPT': 'TimeGPT_zero_shot'}), on=['unique_id', 'ds'])
.merge(fcst_timegpt_ft.rename(columns={'TimeGPT': 'TimeGPT_finetuned'}), on=['unique_id', 'ds'])
.merge(fcst_timegpt_2.rename(columns={'TimeGPT': 'TimeGPT_2'}), on=['unique_id', 'ds']),
train_df=df_train,
metrics=metrics,
agg_fn='mean',
).set_index('metric')
metrics_tgpt
The progression shows dramatic improvements:
- Zero-shot baseline: TimeGPT-1 achieves MASE of 1.12
- Fine-tuning benefit: Reduces error to 0.83 (26% improvement)
- TimeGPT-2 breakthrough: Achieves 0.43 (48% better than fine-tuned)
- No tuning required: TimeGPT-2 delivers superior accuracy out-of-the-box
Final Model Comparison
Compare all approaches including statistical models and TimeGPT-2:
plot_metric_bar_multi(dfs=[metrics_base, metrics_sf_best, metrics_tgpt], metric='mase')
The results show both accuracy and computational performance:
Key findings:
- TimeGPT-2 dominates: Best accuracy (MASE 0.43) while being 72x faster than StatsForecast
- StatsForecast trades time for interpretability: Strong results (MASE 0.71) with transparent statistical models
- Fine-tuning is efficient: Improves TimeGPT-1 from 1.12 to 0.83 with only 0.5 seconds overhead
- Model choice matters: Proper selection reduces errors by 18x compared to Naive baseline
Conclusion
This article demonstrated two powerful approaches to automatic forecasting:
Choose TimeGPT when you:
- Need instant forecasts with minimal setup
- Want to leverage foundation model intelligence for complex patterns
- Prefer API-based forecasting without local infrastructure
Choose StatsForecast when you:
- Prefer interpretable statistical methods
- Need full control over model selection
- Want to run everything locally with transparent optimization
For SQL-native forecasting in Snowflake, see TimeGPT in Snowflake to generate forecasts directly in your data warehouse without Python or ML infrastructure.
For automated feature engineering and gradient boosting methods, see our MLforecast guide which complements the statistical modeling approaches covered in this article.