Have you ever hit memory limits while working with time series data? pandas DataFrames struggle with large datasets due to eager evaluation that loads entire datasets into memory.
Polars offers a superior alternative with lazy evaluation and memory-efficient columnar storage. TimeGPT supports Polars natively, eliminating conversion overhead between data processing and forecasting.
Introduction to TimeGPT
TimeGPT is a foundation model for time series that provides zero-shot forecasting capabilities. Unlike traditional models that require training on your specific data, TimeGPT comes pre-trained on millions of time series patterns and generates predictions instantly through API calls.
The key differentiator is its universal DataFrame support. While most forecasting libraries force you into pandas, TimeGPT works natively with:
- Pandas DataFrames for compatibility
- Polars DataFrames for speed and memory efficiency
- Spark DataFrames for distributed computing
- Any DataFrame implementing the DataFrame Interchange Protocol
The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to follow along!
Loading Data with Polars
Let's start by loading into a Polars DataFrame the M4 competition dataset, which contains over 100,000 time series.
import polars as pl
import pandas as pd
import os
from dotenv import load_dotenv
from nixtla import NixtlaClient
from datasetsforecast.m4 import M4
import time
First, load a subset of the M4 hourly dataset to demonstrate the basics:
# Load M4 hourly data
m4_data = M4.load(directory='data/', group='Hourly')
train_df = m4_data[0]
Converting pandas DataFrames to Polars is straightforward with pl.from_pandas().
# Convert to Polars for better performance
train_pl = pl.from_pandas(train_df)
print(f"Dataset shape: {train_pl.shape}")
print(train_pl.head())
Output:
Dataset shape: (373372, 3)
shape: (5, 3)
┌───────────┬─────┬───────┐
│ unique_id ┆ ds ┆ y │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞═══════════╪═════╪═══════╡
│ H1 ┆ 1 ┆ 605.0 │
│ H1 ┆ 2 ┆ 586.0 │
│ H1 ┆ 3 ┆ 586.0 │
│ H1 ┆ 4 ┆ 559.0 │
│ H1 ┆ 5 ┆ 511.0 │
└───────────┴─────┴───────┘
Select 10 time series for initial demo.
# Select 10 time series for initial demo
sample_ids = train_pl.select("unique_id").unique().limit(10)["unique_id"].implode()
# Filter the main dataframe using the sample IDs
demo_df = train_pl.filter(pl.col("unique_id").is_in(sample_ids))
In this code:
- The
select()method chooses specific columns,unique()removes duplicates, andlimit()restricts the results. implode()converts theunique_idcolumn to a list.filter()filters the dataframe using sample IDs, whileis_in()checks if values exist in the list.
Next, we'll convert the integer timestamps to datetime.
# Convert integer timestamps to datetime
base_datetime = pl.datetime(2020, 1, 1)
demo_long = demo_df.with_columns([
(base_datetime + pl.duration(hours=pl.col("ds") - 1)).alias("ds")
])
# Keep only required columns and filter out missing values
demo_long = demo_long.select(["unique_id", "ds", "y"]).filter(pl.col("y").is_not_null())
print(demo_long.head())
Output:
shape: (5, 3)
┌───────────┬─────────────────────┬──────┐
│ unique_id ┆ ds ┆ y │
│ --- ┆ --- ┆ --- │
│ str ┆ datetime[μs] ┆ f64 │
╞═══════════╪═════════════════════╪══════╡
│ H188 ┆ 2020-01-01 00:00:00 ┆ 12.4 │
│ H188 ┆ 2020-01-01 01:00:00 ┆ 11.9 │
│ H188 ┆ 2020-01-01 02:00:00 ┆ 11.5 │
│ H188 ┆ 2020-01-01 03:00:00 ┆ 11.2 │
│ H188 ┆ 2020-01-01 04:00:00 ┆ 11.0 │
└───────────┴─────────────────────┴──────┘
In this code:
pl.datetime()creates a base datetime andpl.duration()calculates time offsets from integer values.pl.col("ds") - 1converts 1-indexed timestamps to 0-indexed for proper hour calculation.select()keeps only required columns andis_not_null()filters out missing values.
Performance at Scale: Polars vs Pandas
Before diving into forecasting, let's demonstrate why Polars matters for larger datasets. We'll compare performance between pandas and Polars using the full M4 hourly dataset we already loaded.
Start with creating a timing decorator:
# Create timing decorator for accurate performance measurement
def time_it(n_runs=10):
"""Decorator that runs a function n_runs times and returns average time."""
def decorator(func):
def wrapper(*args, **kwargs):
times = []
result = None
for _ in range(n_runs):
start = time.time()
result = func(*args, **kwargs)
times.append(time.time() - start)
avg_time = sum(times) / len(times)
return result, avg_time
return wrapper
return decorator
Apply the decorator to the functions and run 100 times:
# Define operations as decorated functions
@time_it(n_runs=100)
def pandas_aggregation(df, ids):
return (
df[df["unique_id"].isin(ids)]
.groupby("unique_id")["y"]
.agg(["count", "mean", "std"])
)
@time_it(n_runs=100)
def polars_aggregation(df, ids):
return df.filter(pl.col("unique_id").is_in(ids)).group_by("unique_id").agg([
pl.col("y").count().alias("count"),
pl.col("y").mean().alias("mean"),
pl.col("y").std().alias("std"),
])
Compare performance between pandas and Polars with some sample IDs:
# Compare performance with accurate timing
sample_ids = ["H1", "H2", "H3", "H4", "H5"]
pandas_stats, pandas_time = pandas_aggregation(train_df, sample_ids)
polars_stats, polars_time = polars_aggregation(train_pl, sample_ids)
print(f"Pandas: {pandas_time:.4f}s | Polars: {polars_time:.4f}s | Speedup: {pandas_time / polars_time:.1f}x")
Output:
Pandas: 0.0087s | Polars: 0.0027s | Speedup: 3.2x
Polars is 3.2x faster than pandas for this operation.
Next, let's compare memory usage between pandas and Polars.
# Compare memory usage
import sys
pandas_memory = sys.getsizeof(train_df) / 1024 / 1024 # MB
polars_memory = train_pl.estimated_size('mb')
print(f"Pandas DataFrame: {pandas_memory:.1f} MB")
print(f"Polars DataFrame: {polars_memory:.1f} MB")
print(f"Memory savings: {((pandas_memory - polars_memory) / pandas_memory * 100):.1f}%")
Output:
Pandas DataFrame: 37.3 MB
Polars DataFrame: 7.0 MB
Memory savings: 81.1%
Polars' columnar storage delivers 81% memory savings, enabling larger time series datasets without requiring distributed computing.
Basic Forecasting
Now let's generate our first forecast using TimeGPT with a Polars DataFrame.
First, we need to initialize the TimeGPT client:
# Load the API key from the .env file
load_dotenv()
# Initialize the TimeGPT client
nixtla_client = NixtlaClient(
api_key=os.environ['NIXTLA_API_KEY']
)
TimeGPT's forecast() method accepts Polars DataFrames directly. Key parameters: h for forecast horizon, freq for data frequency, time_col and target_col to specify column names.
# Generate forecasts directly from Polars DataFrame
forecast_df = nixtla_client.forecast(
df=demo_long,
h=24, # Forecast 24 hours ahead
freq='1h',
time_col='ds',
target_col='y'
)
print(f"Generated {len(forecast_df)} forecasts for {len(forecast_df['unique_id'].unique())} series")
print(forecast_df.head())
Output:
Generated 240 forecasts for 10 series
shape: (5, 3)
┌───────────┬─────────────────────┬───────────┐
│ unique_id ┆ ds ┆ TimeGPT │
│ --- ┆ --- ┆ --- │
│ str ┆ datetime[μs] ┆ f64 │
╞═══════════╪═════════════════════╪═══════════╡
│ H123 ┆ 2020-02-01 04:00:00 ┆ 1256.7139 │
│ H123 ┆ 2020-02-01 05:00:00 ┆ 1147.082 │
│ H123 ┆ 2020-02-01 06:00:00 ┆ 1034.873 │
│ H123 ┆ 2020-02-01 07:00:00 ┆ 939.2955 │
│ H123 ┆ 2020-02-01 08:00:00 ┆ 829.92975 │
└───────────┴─────────────────────┴───────────┘
Visualizing Forecasts
TimeGPT's built-in plotting functionality makes it easy to visualize both historical data and forecasts. The plot() method automatically handles Polars DataFrames and creates professional time series visualizations.
# Plot the forecast with historical data
nixtla_client.plot(
df=demo_long,
forecasts_df=forecast_df,
time_col='ds',
target_col='y',
max_insample_length=100 # Show last 100 historical points
)
