Time Series Forecasting with Spark

Overview

Spark is an open-source distributed compute framework designed for large-scale data processing. This guide demonstrates how to use TimeGPT with Spark to perform forecasting and cross-validation across distributed clusters. Spark is ideal for enterprise environments with existing Hadoop infrastructure and datasets exceeding 100 million observations. Its robust distributed architecture handles massive-scale time series forecasting with fault tolerance and high performance.

Why Use Spark for Time Series Forecasting?

Spark offers unique advantages for enterprise-scale time series forecasting:

Enterprise-grade scalability: Handle datasets with 100M+ observations across distributed clusters
Hadoop integration: Seamlessly integrate with existing HDFS and Hadoop ecosystems
Fault tolerance: Automatic recovery from node failures ensures reliable computation
Mature ecosystem: Leverage Spark’s rich ecosystem of tools and libraries
Multi-language support: Work with Python (PySpark), Scala, or Java

Choose Spark when you have enterprise infrastructure, datasets exceeding 100 million observations, or need robust fault tolerance for mission-critical forecasting. What you’ll learn:

Install Fugue with Spark support for distributed computing
Convert pandas DataFrames to Spark DataFrames
Run TimeGPT forecasting and cross-validation on Spark clusters

Prerequisites

Before proceeding, make sure you have an API key from Nixtla. If executing on a distributed Spark cluster, ensure the nixtla library is installed on all worker nodes for consistent execution.

How to Use TimeGPT with Spark

Step 1: Install Fugue and Spark

Fugue provides a convenient interface to distribute Python code across frameworks like Spark. Install Fugue with Spark support:

pip install fugue[spark]

To work with TimeGPT, make sure you have the nixtla library installed as well.

Step 2: Load Your Data

Load the dataset into a pandas DataFrame. In this example, we use hourly electricity price data from different markets:

import pandas as pd

df = pd.read_csv(
    'https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv',
    parse_dates=['ds'],
)
df.head()

Step 3: Initialize Spark

Create a Spark session and convert your pandas DataFrame to a Spark DataFrame:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

spark_df = spark.createDataFrame(df)
spark_df.show(5)

Step 4: Use TimeGPT on Spark

To use TimeGPT with Spark, provide a Spark DataFrame to Nixtla’s client methods instead of a pandas DataFrame. The main difference from local usage is working with Spark DataFrames instead of pandas DataFrames. Instantiate the NixtlaClient class to interact with Nixtla’s API:

from nixtla import NixtlaClient

nixtla_client = NixtlaClient(
    api_key='my_api_key_provided_by_nixtla'
)

You can use any method from the NixtlaClient, such as forecast or cross_validation.

Forecast Example
Cross-validation Example

fcst_df = nixtla_client.forecast(spark_df, h=12)
fcst_df.show(5)

When using Azure AI endpoints, specify model="azureai":

nixtla_client.forecast(
    spark_df,
    h=12,
    model="azureai"
)

The public API supports two models: timegpt-1 (default) and timegpt-1-long-horizon. For long horizon forecasting, see the long-horizon model tutorial.

cv_df = nixtla_client.cross_validation(
    spark_df,
    h=12,
    n_windows=5,
    step_size=2
)
cv_df.show(5)

Step 5: Stop Spark

After completing your tasks, stop the Spark session to free resources:

spark.stop()

Working with Exogenous Variables

TimeGPT with Spark also supports exogenous variables. Refer to the Exogenous Variables Tutorial for details. Simply substitute pandas DataFrames with Spark DataFrames—the API remains identical. Explore more distributed forecasting options:

Distributed Computing Overview - Compare Spark, Dask, and Ray
Dask Integration - For datasets with 10M-100M observations
Ray Integration - For ML pipeline integration
Fine-tuning TimeGPT - Improve accuracy at scale
Cross-Validation - Validate distributed forecasts

INTRODUCTION

SETUP

DATA REQUIREMENTS

FORECASTING

ANOMALY DETECTION

USE CASES

REFERENCE

About

Overview

Why Use Spark for Time Series Forecasting?

Prerequisites

How to Use TimeGPT with Spark

Step 1: Install Fugue and Spark

Step 2: Load Your Data

Step 3: Initialize Spark

Step 4: Use TimeGPT on Spark

Step 5: Stop Spark

Working with Exogenous Variables

INTRODUCTION

SETUP

DATA REQUIREMENTS

FORECASTING

ANOMALY DETECTION

USE CASES

REFERENCE

About

​Overview

​Why Use Spark for Time Series Forecasting?

​Prerequisites

​How to Use TimeGPT with Spark

​Step 1: Install Fugue and Spark

​Step 2: Load Your Data

​Step 3: Initialize Spark

​Step 4: Use TimeGPT on Spark

​Step 5: Stop Spark

​Working with Exogenous Variables

​Related Resources

Overview

Why Use Spark for Time Series Forecasting?

Prerequisites

How to Use TimeGPT with Spark

Step 1: Install Fugue and Spark

Step 2: Load Your Data

Step 3: Initialize Spark

Step 4: Use TimeGPT on Spark

Step 5: Stop Spark

Working with Exogenous Variables

Related Resources