Building effective time series models requires creating dozens of engineered features: lag values from previous periods, rolling averages over different windows, seasonal indicators, and target transformations. The traditional approach involves writing hundreds of lines of custom pandas code, handling edge cases manually, and maintaining separate pipelines for each use case.
This manual process creates several problems:
- Time consuming: 1-2 hours per model just for feature creation
- Error prone: Missing values, incorrect window calculations, transformation reversals
- Inconsistent: Different feature implementations across models and teams
- Maintenance heavy: Breaking when data schemas or patterns change
MLforecast eliminates this pain with automated feature engineering that's faster, more reliable, and consistent across all your forecasting models.
Before diving into automated feature engineering, establishing baseline model performance provides essential context for measuring improvement and avoiding over-engineering.
The source code of this article can be found in the interactive Jupyter notebook.
Introduction to MLforecast
MLforecast is Nixtla's machine learning forecasting library that handles the complete time series modeling pipeline:
- Automated feature engineering: Lag features, rolling statistics, date features
- Model training: Multiple algorithms with unified API
- Cross-validation: Time series-aware validation splits
- Prediction generation: Forecasts with automatic transformation handling
You define models and features through simple parameters while MLforecast handles the complex implementation details.
To install MLforecast, run:
pip install mlforecastOther dependencies for the examples in this article:
pip install pandas numpy lightgbmWe'll use LightGBM for our machine learning models throughout this tutorial.
Setup - Installation and Basic Configuration
Import the necessary libraries:
import pandas as pd
import numpy as np
from mlforecast import MLForecast
from mlforecast.lag_transforms import RollingMean, ExpandingMean
from mlforecast.target_transforms import Differences
import lightgbm as lgbLet's start with a simple e-commerce demand forecasting scenario:
# Generate sample e-commerce sales data
np.random.seed(42)
dates = pd.date_range("2023-01-01", "2024-12-01", freq="D")
products = ["product_1", "product_2", "product_3"]
data = []
for product in products:
# Create realistic sales patterns with trend and seasonality
trend = np.linspace(100, 200, len(dates))
seasonal = 50 * np.sin(2 * np.pi * np.arange(len(dates)) / 7) # Weekly pattern
noise = np.random.normal(0, 20, len(dates))
sales = np.maximum(0, trend + seasonal + noise)
product_data = pd.DataFrame({"unique_id": product, "ds": dates, "y": sales})
data.append(product_data)
sales_data = pd.concat(data, ignore_index=True)
print(f"Dataset shape: {sales_data.shape}")
sales_data.head()Dataset shape: (2046, 3)| unique_id | ds | y |
|---|---|---|
| product_1 | 2023-01-01 | 149.967142 |
| product_1 | 2023-01-02 | 194.064742 |
| product_1 | 2023-01-03 | 156.073594 |
| product_1 | 2023-01-04 | 169.276074 |
| product_1 | 2023-01-05 | 135.228628 |
Now let's configure MLforecast with basic automated features:
# Basic MLforecast configuration with automated features
fcst = MLForecast(
models=lgb.LGBMRegressor(verbosity=-1),
freq="D",
lags=[1, 7, 14], # Previous day, week, and two weeks
date_features=["dayofweek", "month"], # Automatic date features
)
print("Configured features:")
print(f"Lags: {fcst.ts.lags}")
print(f"Date features: {fcst.ts.date_features}")Configured features:
Lags: [1, 7, 14]
Date features: ['dayofweek', 'month']Automated Lag Feature Engineering - Replacing Manual Lag Creation
Lag features use previous time periods' values to predict future outcomes. A lag-1 feature contains yesterday's sales value, lag-7 contains last week's value, and so on. These historical values often predict future patterns better than raw timestamps alone.
Creating lag features manually with pandas requires dozens of lines of custom code. Here's what you would typically write manually:
# Traditional manual approach
def create_features_manually(df, lags, date_features):
"""Manual feature creation - replicates MLforecast preprocessing"""
df_with_features = df.copy()
# Create lag features with MLforecast naming
for lag in lags:
df_with_features[f"lag{lag}"] = df_with_features.groupby("unique_id")[
"y"
].shift(lag)
# Create date features
for feature in date_features:
if feature == "dayofweek":
df_with_features["dayofweek"] = df_with_features["ds"].dt.dayofweek
elif feature == "month":
df_with_features["month"] = df_with_features["ds"].dt.month
# Remove rows where any lag feature is NaN
lag_columns = [f"lag{lag}" for lag in lags]
df_with_features = df_with_features.dropna(subset=lag_columns)
return df_with_features
# Manual approach demonstration
manual_result = create_features_manually(
sales_data, lags=[1, 7, 14], date_features=["dayofweek", "month"]
)MLforecast handles all this complexity automatically. The preprocess() method:
- Reads your lag configuration (
lags=[1, 7, 14]) - Creates lag columns using efficient pandas operations
- Adds configured date features automatically
- Filters out rows where lag values cannot be calculated
# MLforecast automated approach
# Lags are created automatically when preprocessing
preprocessed_data = fcst.preprocess(sales_data)
print("Automatically created features:")
print(preprocessed_data.columns.tolist())
# Show lag features for one product
product_sample = preprocessed_data[preprocessed_data["unique_id"] == "product_1"]
print(f"\nLag features for product_1 (first 5 rows):")
print(product_sample[["ds", "y", "lag1", "lag7", "lag14"]].head(5))Automatically created features:
['unique_id', 'ds', 'y', 'lag1', 'lag7', 'lag14', 'dayofweek', 'month']
Lag features for product_1 (first 5 rows):
ds y lag1 lag7 lag14
14 2023-01-15 67.501643 24.499964 116.348695 109.934283
15 2023-01-16 129.988681 67.501643 130.844944 136.469145
16 2023-01-17 130.775487 129.988681 160.883311 161.985881
17 2023-01-18 130.407705 130.775487 113.854405 152.583356
18 2023-01-19 62.716760 130.407705 70.562647 74.194174Advanced Lag Features - Rolling Statistics and Expanding Means
Beyond basic lag values, MLforecast can apply lag transformations to lag features for richer patterns.
Lag transforms work in two steps:
Create raw historical values with
lags=[1, 7, 14]lag1= yesterday's exact sales (150 units)lag7= last week's exact sales (120 units)
Apply statistics to those lag features with
lag_transformsRollingMean(window_size=7)onlag1= 7-day average of yesterday's values (145 units)ExpandingMean()onlag7= growing average of weekly values (from 120 to 135 units over time)
# Enhanced MLforecast with lag transforms
fcst_enhanced = MLForecast(
models=lgb.LGBMRegressor(verbosity=-1),
freq="D",
lags=[1, 7, 14],
lag_transforms={
1: [RollingMean(window_size=7)], # 7-day rolling mean of yesterday's values
7: [ExpandingMean()], # Expanding mean of weekly values
},
date_features=["dayofweek", "month"],
)
# Process data with enhanced lag features
enhanced_data = fcst_enhanced.preprocess(sales_data)Now let's examine what enhanced features were automatically created and view the transformed data:
print("Enhanced lag features:")
print(enhanced_data.columns.tolist())
# Show enhanced features for one product
enhanced_sample = enhanced_data[enhanced_data["unique_id"] == "product_1"].head(10)
print(f"\nEnhanced features for product_1 (first 5 rows):")
print(
enhanced_sample[
["ds", "y", "rolling_mean_lag1_window_size7", "expanding_mean_lag7"]
].head()
)Enhanced lag features:
['unique_id', 'ds', 'y', 'lag1', 'lag7', 'lag14', 'rolling_mean_lag1_window_size7', 'expanding_mean_lag7', 'dayofweek', 'month']
Enhanced features for product_1 (first 5 rows):
ds y rolling_mean_lag1_window_size7 expanding_mean_lag7
14 2023-01-15 67.501643 96.400157 111.518814
15 2023-01-16 129.988681 89.422007 113.666161
16 2023-01-17 130.775487 89.299684 118.387876
17 2023-01-18 130.407705 84.998566 117.975743
18 2023-01-19 62.716760 87.363323 114.024651Let's prepare some data to visualize how these transforms work:
# Prepare data for visualization comparison
product_viz = sales_data[sales_data["unique_id"] == "product_1"].tail(60)
product_viz["rolling_7"] = product_viz["y"].rolling(7).mean()
product_viz["expanding"] = product_viz["y"].expanding().mean()Now create a visualization to compare the different patterns:
# Visualize the different patterns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(product_viz["ds"], product_viz["y"], label="Original Sales", alpha=0.6)
ax.plot(product_viz["ds"], product_viz["rolling_7"], label="7-day Rolling Mean")
ax.plot(product_viz["ds"], product_viz["expanding"], label="Expanding Mean")
ax.legend()
plt.show()