Automatic Model Selection with StatsForecast for Time Series Forecasting | Nixtla

Sections

Introduction to StatsForecast
Setup - M4 Hourly Competition Data
Baseline Models - Naive and SeasonalNaive
Advanced Statistical Models
Cross-Validation for Model Selection
Best Model Forecasts with Prediction Intervals
TimeGPT - Foundation Model for Time Series
Conclusion
Related Resources

Imagine you have multiple time series to forecast. Which model should you use for each one? ARIMA? ETS? Theta? Or just a simple Naive baseline?

Using one model for all series is easy but sacrifices accuracy since each series has different patterns. However, manually testing different models for each series creates problems:

Time consuming: Hours spent fitting and comparing individual models
Expertise required: Each algorithm needs different parameter configurations
Inconsistent evaluation: Different validation approaches across models
Doesn't scale: Manual testing for hundreds or thousands of series is impractical

StatsForecast automates this process by fitting multiple statistical models simultaneously, then using cross-validation to select the best performer for each time series.

This article demonstrates how to use StatsForecast's automatic model selection with the M4 hourly competition data, then compares it against TimeGPT, Nixtla's foundation model.

The source code of this article can be found in the interactive Jupyter notebook.

Introduction to StatsForecast

StatsForecast is an open-source Python library that automates statistical forecasting through:

Automatic model selection: AutoARIMA, AutoETS, AutoCES, and AutoTheta optimize parameters automatically
Speed: 20x faster than pmdarima, leveraging Numba for performance
Scale: Handles hundreds or thousands of series with parallel processing

The library follows scikit-learn's familiar .fit() and .predict() API pattern.

To install StatsForecast, run:

pip install statsforecast utilsforecast

Additional dependencies for this tutorial:

pip install pandas numpy

Setup - M4 Hourly Competition Data

Import the necessary libraries:

import numpy as np
import pandas as pd

from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA, AutoETS, AutoCES, AutoTheta, Naive, SeasonalNaive
from utilsforecast.plotting import plot_series
from utilsforecast.evaluation import evaluate
from utilsforecast.losses import mae, rmse, smape, mase

Load the M4 hourly benchmark dataset, which contains 414 hourly time series ranging from 700 to 960 observations each:

# Load M4 hourly competition data
Y_train_df = pd.read_csv('https://auto-arima-results.s3.amazonaws.com/M4-Hourly.csv')
Y_test_df = pd.read_csv('https://auto-arima-results.s3.amazonaws.com/M4-Hourly-test.csv')

# Convert hour indices to datetime
Y_train_df['ds'] = pd.to_datetime('2024-01-01') + pd.to_timedelta(Y_train_df['ds'], unit='h')
Y_test_df['ds'] = pd.to_datetime('2024-01-01') + pd.to_timedelta(Y_test_df['ds'], unit='h')

Sample 8 series for demonstration:

# Randomly select 8 series
n_series = 8
uids = Y_train_df['unique_id'].drop_duplicates().sample(8, random_state=23).values
df_train = Y_train_df.query('unique_id in @uids')
df_test = Y_test_df.query('unique_id in @uids')

print(f"Training observations: {len(df_train)}")
print(f"Test observations: {len(df_test)}")

Training observations: 6640
Test observations: 384

Define evaluation metrics with hourly seasonality:

MASE (Mean Absolute Scaled Error): Scaled error vs seasonal baseline (< 1.0 = beats baseline, = 1.0 = matches baseline, > 1.0 = worse than baseline)
RMSE (Root Mean Squared Error): Measures the magnitude of prediction errors
SMAPE (Symmetric Mean Absolute Percentage Error): Calculates percentage-based accuracy

# Define error metrics with 24-hour seasonality
from functools import partial
hourly_mase = partial(mase, seasonality=24)
metrics = [hourly_mase, rmse, smape]

Visualize the selected time series:

# Plot the selected series
plot_series(df_train, df_test.rename(columns={"y": "actual"}), max_ids=4)

Selected Series

Actual

Each series shows different patterns. Some have strong daily cycles, others trend up or down over time, and some are quite volatile. This variety is why selecting the right model for each series matters.

Baseline Models - Naive and SeasonalNaive

Before diving into complex models, start with simple baselines:

Naive: Uses the last observed value as the forecast
SeasonalNaive: Captures seasonal patterns by repeating values from the previous cycle (24 hours ago for hourly data)

These baselines provide a performance floor. Any sophisticated model should beat these simple approaches.

# Configure baseline models
sf_base = StatsForecast(
    models=[Naive(), SeasonalNaive(season_length=24)],
    freq='H',
    n_jobs=-1
)

# Generate 48-hour forecasts
fcst_base = sf_base.forecast(df=df_train, h=48)

# Merge with test data for evaluation
eval_base = df_test.merge(fcst_base, on=['unique_id', 'ds'])

Visualize baseline predictions:

# Plot baseline forecasts
plot_series(df_train, eval_base, max_ids=4, max_insample_length=5*24)

Selected Series

Naive

Seasonal

The plot shows 5 days of historical data followed by 48-hour forecasts. The cyan line shows SeasonalNaive following the daily rhythm, while the pink Naive line stays flat at the last value.

Evaluate baseline performance:

# Calculate metrics for baseline models
metrics_base = evaluate(
    df=eval_base,
    train_df=df_train,
    metrics=metrics,
    agg_fn='mean',
).set_index('metric')

metrics_base

	Naive	SeasonalNaive
mase	8.029174	0.993421
rmse	179.520049	66.529088
smape	0.252074	0.065754

The SeasonalNaive model performs significantly better than Naive, achieving MASE close to 1.0. This suggests strong seasonal patterns in the hourly data that repeating values from 24 hours ago captures effectively.

Advanced Statistical Models

Let's move beyond baselines with more sophisticated models:

AutoARIMA: Handles autocorrelation, trends, and seasonality
AutoETS: Auto-selects exponential smoothing components
AutoCES: Offers flexible cyclical pattern modeling
AutoTheta: Fast, robust forecasting that often wins competitions

These automatic models eliminate manual parameter tuning while ensuring forecasts meet statistical standards for accuracy and reliability.

Configure the automatic models:

# Define automatic statistical models
models = [
    AutoARIMA(season_length=24),
    AutoETS(season_length=24),
    AutoCES(season_length=24),
    AutoTheta(season_length=24)
]

Fit all models and generate forecasts in one step:

# Initialize StatsForecast with all models
sf = StatsForecast(
    models=models,
    freq='H',
    n_jobs=-1
)

# Fit and forecast with 90% prediction intervals
fcst_sf_models = sf.forecast(df=df_train, h=48, level=[90])

# Merge with test data
eval_sf_models = df_test.merge(fcst_sf_models, on=['unique_id', 'ds'])

The forecast() method automatically fits each model to every series, optimizes parameters, and generates predictions with uncertainty intervals in a single function call.

Visualize predictions from all models:

# Plot forecasts from all automatic models
plot_series(df_train, eval_sf_models, max_ids=4, max_insample_length=5*24)

Selected Series

AutoARIMA

AutoETS

CES

AutoTheta

Unlike the simple baselines, these models adapt to the data's complexity and follow the actual patterns much more closely.

Evaluate performance across all models:

# Calculate metrics for automatic models
metrics_sf_models = evaluate(
    df=eval_sf_models,
    metrics=metrics,
    train_df=df_train,
    agg_fn='mean',
).set_index('metric')

metrics_sf_models

	AutoARIMA	AutoETS	CES	AutoTheta
mase	0.803407	1.331669	0.729921	1.868366
rmse	71.456734	122.784231	60.979897	65.105242
smape	0.063307	0.075775	0.079244	0.076261

AutoCES achieves the lowest MASE (0.73) and RMSE (60.98), outperforming both AutoARIMA and baseline models. All automatic models beat the Naive baseline, demonstrating that automated parameter optimization works effectively.

Compare all models visually:

# Compare baseline and automatic models
from utils import plot_metric_bar_multi
plot_metric_bar_multi(dfs=[metrics_sf_models, metrics_base], metric='mase')

MASE Comparison Accross Models

This bar chart shows MASE (Mean Absolute Scaled Error), where values below 1.0 beat the baseline and values above 1.0 perform worse:

Naive (MASE 8.03): Performs worst, far below the seasonal baseline
AutoTheta (MASE 1.87) and AutoETS (MASE 1.33): Above 1.0, don't beat SeasonalNaive
AutoARIMA (MASE 0.80) and AutoCES (MASE 0.73): Below 1.0, successfully outperform the baseline

Cross-Validation for Model Selection

While AutoCES performs best on average, different models might work better for individual series. Cross-validation helps identify the optimal model for each time series.

Traditional cross-validation uses random splits, which doesn't work for time series. Randomly shuffling would let the model peek into the future, creating data leakage. StatsForecast fixes this by ensuring models only train on historical data, never future values.

Time series cross-validation uses a rolling window approach:

Train all models on initial window (yellow in the visualization below)
Forecast the next period (red region) and evaluate forecast accuracy
Slide the window forward in time and add new data to training set
Repeat: train, forecast, evaluate
Compare models across all windows and select the best for each series

Run cross-validation with rolling windows:

# Cross-validation with 2 rolling windows
cv_df = sf.cross_validation(
    df=df_train,
    h=24,          # Forecast next 24 hours
    step_size=24,  # Step forward 24 hours
    n_windows=2    # 2 evaluation windows
)

print(f"Cross-validation results: {cv_df.shape}")
cv_df.head()

Cross-validation results: (384, 11)

unique_id	ds	cutoff	y	AutoARIMA	AutoETS	CES	AutoTheta
H10	2024-01-29 05:00:00	2024-01-29 04:00:00	14502.0	14478.23	14512.45	14489.12	14501.67
H10	2024-01-29 06:00:00	2024-01-29 04:00:00	14547.0	14523.78	14558.19	14534.87	14547.42
H10	2024-01-29 07:00:00	2024-01-29 04:00:00	14595.0	14571.34	14606.01	14582.63	14595.18

Evaluate model performance and select the best for each series:

# Evaluate models using MAE across cross-validation windows
from utils import evaluate_cv, get_best_model_forecast

evaluation_df = evaluate_cv(cv_df, mae)

# Count how many times each model was selected
evaluation_df['best_statsforecast_model'].value_counts().to_frame().reset_index()

best_statsforecast_model	count
AutoARIMA	3
AutoETS	2
AutoTheta	2
CES	1

AutoARIMA was selected as the best model for 3 out of 8 series, while AutoETS and AutoTheta each won 2 series. This demonstrates that different models excel on different time series patterns.

Best Model Forecasts with Prediction Intervals

After selecting the best model for each series, let's generate final forecasts with uncertainty intervals:

# Extract forecasts from the best model for each series
best_fcst_sf = get_best_model_forecast(fcst_sf_models, evaluation_df)
eval_best_sf = df_test.merge(best_fcst_sf, on=['unique_id', 'ds'])

# Plot forecasts with 90% prediction intervals
plot_series(df_train, eval_best_sf, level=[90], max_insample_length=5*24, max_ids=4)

Forecasts with Prediction Intervals

StatsForecast Model

Model Level 90

These are the forecasts from the best model for each series. The shaded bands represent 90% prediction intervals:

Tighter bands: More confidence in the forecast
Wider bands: More uncertainty in the forecast

Calculate metrics for the best model selection approach:

# Calculate metrics for best model selection
metrics_sf_best = evaluate(
    df=eval_best_sf,
    train_df=df_train,
    metrics=metrics,
    agg_fn='mean',
).set_index('metric')

Compare baseline, individual models, and best model selection:

# Compare baseline, individual models, and best selection
plot_metric_bar_multi(dfs=[metrics_sf_models, metrics_base, metrics_sf_best], metric='mase')

MASE Comparison Accross Model Groups

The green bar, which represents the best model selection, has the lowest MASE. By selecting the best model for each series, we get stronger performance than applying a single model to every time series.

TimeGPT - Foundation Model for Time Series

Now let's try TimeGPT, Nixtla's foundation model for time series forecasting. It's pre-trained on millions of diverse series and requires minimal tuning.

Why TimeGPT?

Strong out-of-the-box accuracy with minimal tuning
Handles trend/seasonality/holidays automatically
Scales to many series with simple APIs

Setup TimeGPT Client

First, install the Nixtla package:

pip install nixtla

Initialize the TimeGPT client using an API key:

# Import necessary packages
import os
from dotenv import load_dotenv
from nixtla import NixtlaClient

# Load environment variables from .env
load_dotenv()
api_key = os.getenv("NIXTLA_API_KEY")

# Initialize the client
nixtla_client = NixtlaClient(api_key=api_key)

This connects to Nixtla's cloud service where the foundation model runs. The setup is simple: just import and authenticate.

Zero-Shot Forecast with TimeGPT

Zero-shot forecasting means using the pre-trained model directly without any training on your specific data. This approach saves time by eliminating the training step while still leveraging patterns learned from millions of diverse time series.

The code below generates 48-hour forecasts with 80% and 90% prediction intervals in a single API call:

# Simple zero-shot TimeGPT forecast
fcst_timegpt = nixtla_client.forecast(
    df=df_train,
    h=48,         # forecast horizon (next 48 hours)
    freq='H',     # hourly frequency
    level=['80', '90']
)
fcst_timegpt.head()

	unique_id	ds	TimeGPT	TimeGPT-hi-80	TimeGPT-hi-90	TimeGPT-lo-80	TimeGPT-lo-90
0	H165	2024-01-30 05:00:00	20.847889	26.581411	27.379568	15.114367	14.316211
1	H165	2024-01-30 06:00:00	25.407340	34.487167	34.707325	16.327513	16.107355
2	H165	2024-01-30 07:00:00	49.702620	75.707430	79.375336	23.697813	20.029905
3	H165	2024-01-30 08:00:00	134.175630	152.434750	153.382050	115.916504	114.969210
4	H165	2024-01-30 09:00:00	366.113680	379.054170	381.407230	353.173200	350.820130

Fine-Tuned Forecast with TimeGPT

Fine-tuning adapts the pre-trained model to your specific data patterns by running additional training steps on your dataset. This often improves accuracy for domain-specific patterns that differ from the general patterns TimeGPT learned during pre-training.

To fine-tune TimeGPT, simply add the finetune_steps parameter. Here we use 10 finetune steps:

# Add finetune steps to make it more accurate
fcst_timegpt_ft = nixtla_client.forecast(
    df=df_train,
    h=48,
    freq='H',
    level=['80', '90'],
    finetune_steps=10
)

TimeGPT-2 - The Latest Foundation Model

TimeGPT-2, Nixtla's latest foundation model, brings enhanced forecasting accuracy and performance improvements. The model is currently available by invitation only.

To use TimeGPT-2, initialize the client with the preview API endpoint:

# Initialize client with TimeGPT-2 credentials
nixtla_client = NixtlaClient(
    api_key=api_key,
    base_url='https://api-preview.nixtla.io'
)

To use TimeGPT-2, simply add the model='timegpt-2' parameter to the API call:

# Forecast with TimeGPT-2
fcst_timegpt_2 = nixtla_client.forecast(
    df=df_train,
    h=48,
    freq='H',
    level=['80', '90'],
    model='timegpt-2'
)

# Merge with test data
eval_tgpt_2 = df_test.merge(fcst_timegpt_2, on=['unique_id', 'ds'])

Visualize TimeGPT-2 predictions:

# Plot TimeGPT-2 forecasts
fig = nixtla_client.plot(
    df_train,
    eval_tgpt_2,
    level=['80', '90'],
    max_insample_length=5*24,
    max_ids=4
)

TimeGPT-2 - The Latest Foundation Model

TimeGPT

TimeGPT Level 80

TimeGPT Level 90

Compare the performance of the different TimeGPT variants:

# Evaluate all three TimeGPT variants
metrics_tgpt = evaluate(
    df=df_test
        .merge(fcst_timegpt.rename(columns={'TimeGPT': 'TimeGPT_zero_shot'}), on=['unique_id', 'ds'])
        .merge(fcst_timegpt_ft.rename(columns={'TimeGPT': 'TimeGPT_finetuned'}), on=['unique_id', 'ds'])
        .merge(fcst_timegpt_2.rename(columns={'TimeGPT': 'TimeGPT_2'}), on=['unique_id', 'ds']),
    train_df=df_train,
    metrics=metrics,
    agg_fn='mean',
).set_index('metric')

metrics_tgpt

	TimeGPT_zero_shot	TimeGPT_finetuned	TimeGPT_2
mase	1.119019	0.831670	0.428494
rmse	53.228989	53.755869	27.029288
smape	0.056423	0.064954	0.035599

The progression shows dramatic improvements:

Zero-shot baseline: TimeGPT-1 achieves MASE of 1.12
Fine-tuning benefit: Reduces error to 0.83 (26% improvement)
TimeGPT-2 breakthrough: Achieves 0.43 (48% better than fine-tuned)
No tuning required: TimeGPT-2 delivers superior accuracy out-of-the-box

Final Model Comparison

Compare all approaches including statistical models and TimeGPT-2:

# Compare all models including TimeGPT-2
plot_metric_bar_multi(dfs=[metrics_base, metrics_sf_best, metrics_tgpt], metric='mase')

MASE Comparison Accross Model Groups

The results show both accuracy and computational performance:

Model	MASE	Inference Time
TimeGPT-2	0.43	2.5 s
Best StatsForecast	0.71	180 s
TimeGPT-1 (10 finetune steps)	0.83	1.7 s
SeasonalNaive	0.99	-
Naive	8.03	-

Key findings:

TimeGPT-2 dominates: Best accuracy (MASE 0.43) while being 72x faster than StatsForecast
StatsForecast trades time for interpretability: Strong results (MASE 0.71) with transparent statistical models
Fine-tuning is efficient: Improves TimeGPT-1 from 1.12 to 0.83 with only 0.5 seconds overhead
Model choice matters: Proper selection reduces errors by 18x compared to Naive baseline

Conclusion

This article demonstrated two powerful approaches to automatic forecasting:

Aspect	TimeGPT	StatsForecast
Type	Foundation model	Classical statistical models
Speed	Fast, no training required: zero-shot or fine-tuning via API	Medium, requires fitting models locally or in batch
Scalability	Handles thousands of series instantly via API, no need for local compute	Scales efficiently with local compute using parallel processing (n_jobs, ray)
Accuracy	High accuracy result from strong generalization, especially for complex, noisy, or non-seasonal data	Accurate for structured, seasonal, or stable patterns

Choose TimeGPT when you:

Need instant forecasts with minimal setup
Want to leverage foundation model intelligence for complex patterns
Prefer API-based forecasting without local infrastructure

Choose StatsForecast when you:

Prefer interpretable statistical methods
Need full control over model selection
Want to run everything locally with transparent optimization

For SQL-native forecasting in Snowflake, see TimeGPT in Snowflake to generate forecasts directly in your data warehouse without Python or ML infrastructure.

For automated feature engineering and gradient boosting methods, see our MLforecast guide which complements the statistical modeling approaches covered in this article.

Sections

Introduction to StatsForecast

Setup - M4 Hourly Competition Data

Selected Series

Baseline Models - Naive and SeasonalNaive

Selected Series

Advanced Statistical Models

Selected Series

MASE Comparison Accross Models

Cross-Validation for Model Selection

Best Model Forecasts with Prediction Intervals

Forecasts with Prediction Intervals

MASE Comparison Accross Model Groups

TimeGPT - Foundation Model for Time Series

Setup TimeGPT Client

Zero-Shot Forecast with TimeGPT

Fine-Tuned Forecast with TimeGPT

TimeGPT-2 - The Latest Foundation Model

TimeGPT-2 - The Latest Foundation Model

Final Model Comparison

MASE Comparison Accross Model Groups

Conclusion

Related Resources