Overview
Spark is an open-source distributed compute framework designed for large-scale data processing. This guide demonstrates how to use TimeGPT with Spark to perform forecasting and cross-validation across distributed clusters. Spark is ideal for enterprise environments with existing Hadoop infrastructure and datasets exceeding 100 million observations. Its robust distributed architecture handles massive-scale time series forecasting with fault tolerance and high performance.Why Use Spark for Time Series Forecasting?
Spark offers unique advantages for enterprise-scale time series forecasting:- Enterprise-grade scalability: Handle datasets with 100M+ observations across distributed clusters
- Hadoop integration: Seamlessly integrate with existing HDFS and Hadoop ecosystems
- Fault tolerance: Automatic recovery from node failures ensures reliable computation
- Mature ecosystem: Leverage Spark’s rich ecosystem of tools and libraries
- Multi-language support: Work with Python (PySpark), Scala, or Java
- Install Fugue with Spark support for distributed computing
- Convert pandas DataFrames to Spark DataFrames
- Run TimeGPT forecasting and cross-validation on Spark clusters
Prerequisites
Before proceeding, make sure you have an API key from Nixtla. If executing on a distributed Spark cluster, ensure thenixtla library is installed on all worker nodes for consistent execution.
How to Use TimeGPT with Spark
Step 1: Install Fugue and Spark
Fugue provides a convenient interface to distribute Python code across frameworks like Spark. Install Fugue with Spark support:nixtla library installed as well.
Step 2: Load Your Data
Load the dataset into a pandas DataFrame. In this example, we use hourly electricity price data from different markets:Step 3: Initialize Spark
Create a Spark session and convert your pandas DataFrame to a Spark DataFrame:Step 4: Use TimeGPT on Spark
To use TimeGPT with Spark, provide a Spark DataFrame to Nixtla’s client methods instead of a pandas DataFrame. The main difference from local usage is working with Spark DataFrames instead of pandas DataFrames. Instantiate theNixtlaClient class to interact with Nixtla’s API:
NixtlaClient, such as forecast or cross_validation.
- Forecast Example
- Cross-validation Example
model="azureai":timegpt-1 (default) and timegpt-1-long-horizon. For long horizon forecasting, see the long-horizon model tutorial.Step 5: Stop Spark
After completing your tasks, stop the Spark session to free resources:Working with Exogenous Variables
TimeGPT with Spark also supports exogenous variables. Refer to the Exogenous Variables Tutorial for details. Simply substitute pandas DataFrames with Spark DataFrames—the API remains identical.Related Resources
Explore more distributed forecasting options:- Distributed Computing Overview - Compare Spark, Dask, and Ray
- Dask Integration - For datasets with 10M-100M observations
- Ray Integration - For ML pipeline integration
- Fine-tuning TimeGPT - Improve accuracy at scale
- Cross-Validation - Validate distributed forecasts