TimeGPT vs Databricks AutoML Benchmark: Energy, Weather, Retail

Classical methods such as AutoARIMA have powered production forecasting systems for decades, while Prophet became a common production workhorse after its release. Tools like Databricks AutoML have made them even more accessible: automatic model selection, tuned configurations, reproducible notebooks.

However, in recent years, foundation models have changed the baseline. TimeGPT-2.1 requires no training, no feature engineering, and no hyperparameter search, yet consistently outperforms a tuned AutoML pipeline across multiple domains. We ran the numbers to see by exactly how much.

What We Tested

Databricks AutoML searches over AutoARIMA and Prophet configurations, selects the best trial, and generates a reproducible training notebook. In these runs, the selected AutoML trial was Prophet.

We benchmarked it against TimeGPT-2.1 on three public datasets spanning different forecasting domains, evaluating both methods on identical rolling forecast windows with the same held-out data.

Dataset	Domain	Frequency	Series	Horizon	Windows
PJM electricity load	Energy	Hourly	30	24h	7
NOAA ISD weather	Weather	Hourly	491	24h	7
M5 Walmart sales	Retail	Daily	500	28 days	1

Energy: PJM Electricity Load

Electricity demand is a textbook use case for classical forecasting: strong seasonality, relatively stable patterns, limited noise.

Accuracy:

Model	MAE	RMSE	MAPE	WAPE	Bias
AutoML (Prophet)	576.5	1,708.9	14.39%	10.65%	−462.0
TimeGPT-2.1	122.0	313.5	4.24%	2.25%	36.3

TimeGPT reduces MAPE from 14.39% to 4.24%, a 70% reduction in percentage error. The bias figure for AutoML (−462) also reveals a consistent systematic under-forecast across the series.

Per-series win rates:

Metric	AutoML Win Rate	TimeGPT Win Rate
MAE	3.3%	96.7%
RMSE	3.3%	96.7%
MAPE	3.3%	96.7%
WAPE	3.3%	96.7%
AbsBias	13.3%	86.7%

TimeGPT wins on 29 of 30 series for MAE, RMSE, MAPE, and WAPE, and on 26 of 30 series for absolute bias.

Runtime: AutoML took 13 minutes to train, plus 12 seconds for inference. TimeGPT took 20 seconds total.

Weather: NOAA ISD Temperature

The NOAA Integrated Surface Database contains hourly temperature readings from weather stations worldwide. With 491 independent series spanning different climates and observation patterns, this dataset tests how well each method generalizes at scale.

Accuracy:

Model	MAE	RMSE	MAPE	WAPE	Bias
AutoML (Prophet)	2.721	3.600	19.15%	15.00%	1.419
TimeGPT-2.1	1.333	1.887	8.69%	7.35%	0.334

TimeGPT cuts MAE and RMSE roughly in half, and nearly halves MAPE as well. Across 491 series, the aggregate differences are large.

Per-series win rates:

Metric	AutoML Win Rate	TimeGPT Win Rate
MAE	4.9%	95.1%
RMSE	7.9%	92.1%
MAPE	3.9%	96.1%
WAPE	4.9%	95.1%
AbsBias	11.0%	89.0%

Runtime: AutoML required 2 hours 46 minutes to train across 491 series, plus 3 minutes of inference, nearly 3 hours end-to-end. TimeGPT required 37 seconds.

Retail: M5 Walmart Sales

The M5 Forecasting competition dataset is the hardest test in this benchmark. Daily retail demand is sparse, intermittent, and volatile, the kind of signal that resists clean seasonal decomposition. We selected 500 item-store series from the CA3 store group, forecasting 28 days ahead.

Accuracy:

Model	MAE	RMSE	MAPE	WAPE	Bias
AutoML (Prophet)	3.470	7.125	69.28%	54.83%	−0.292
TimeGPT-2.1	3.179	6.621	58.30%	50.23%	−0.995

The margin is smaller here because retail demand is genuinely difficult for any model. TimeGPT still wins on the main aggregate error metrics, MAE, RMSE, MAPE, and WAPE, though AutoML has lower aggregate bias magnitude.

Per-series win rates:

Metric	AutoML Win Rate	TimeGPT Win Rate
MAE	32.0%	68.0%
RMSE	40.6%	59.4%
MAPE	25.4%	74.6%
WAPE	32.0%	68.0%
AbsBias	44.2%	55.8%

Even on the most challenging dataset, TimeGPT wins on more than two-thirds of series by MAE and WAPE.

Runtime: AutoML took over 5 hours to train, plus 50 seconds for inference. TimeGPT took 37 seconds. At that scale, AutoML's training time isn't a cost, it's a blocker. A 5-hour training run rules out same-day iteration, live reforecasting, and rapid experimentation entirely.

The Full Picture

Dataset	AutoML Training	AutoML Inference	TimeGPT Inference	AutoML WAPE to TimeGPT WAPE
PJM electricity (30 series)	13 min	12s	20s	10.65% → 2.25%
NOAA weather (491 series)	2h 46min	3min	37s	15.00% → 7.35%
M5 retail (500 series)	5+ hours	50s	37s	54.83% → 50.23%

TimeGPT has no training column because there is no training step.

Across all three domains, the pattern is the same: TimeGPT improves the main aggregate error metrics, MAE, RMSE, MAPE, and WAPE, and is much faster to deploy. The accuracy gap is largest where data is cleanest and most structured (energy, weather) and narrower on noisy intermittent retail data, with potential improvement available by adding finetune steps.

Why This Happens

AutoML's ceiling is defined by its candidate models. When the search space is AutoARIMA and Prophet, the best possible outcome is a well-tuned classical model. Foundation models like TimeGPT-2.1 have been pretrained on a large, diverse corpus of real-world time series, bringing that learned prior to every new dataset, zero-shot.

The result is a model that already understands seasonality, trend, and temporal dynamics before it sees a single row of your data.

If you're still choosing between Prophet and AutoARIMA as your starting point, there's now a faster and more accurate enterprise solution available.

Get started with TimeGPT →