Audit and Clean Data

The audit_data and clean_data methods from TimeGPT can help you identify and fix potential issues in your data. The audit_data method checks for common problems such as duplicates, missing dates, categorical columns, negative values, and leading zeros. While not all issues will result in errors, addressing them can improve the quality of the forecasts, depending on your specific use case. Once identified, clean_data can be used to automatically fix these issues.

How to Use the Audit and Clean Methods

Step 1: Import Packages

To use the audit_data and clean_data methods, you first need to import and instantiate the NixtlaClient class.

import pandas as pd
from nixtla import NixtlaClient

nixtla_client = NixtlaClient(
    api_key='my_api_key_provided_by_nixtla'  # defaults to os.environ.get("NIXTLA_API_KEY")
)

Step 2: Create Minimal Example

The audit_data method performs a series of checks to identify issues in your data. These checks fall into two categories:

Check Type	Description	Checks Performed
Fail	Issues that will cause errors when you run TimeGPT	Duplicate rows (D001) Missing dates (D002) Categorical feature columns (F001)
Case-specific	Issues that may not cause errors but could negatively affect your results	Negative values (V001) Leading zeros (V002)

To show how the audit_data method works, we will create a sample dataset with missing dates, negative values and leading zeros.

df = pd.DataFrame({
    'unique_id': ['id1', 'id1', 'id1', 'id2', 'id2', 'id2', 'id2', 'id3', 'id3', 'id3', 'id3'],
    'ds': ['2023-01-01', '2023-01-03', '2023-01-04', '2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04'],
    'y': [1, 1, 1, 0, 0, 1, 2, -1, 0, 1, -2]
})
df

unique_id	ds	y
id1	2023-01-01	1
id1	2023-01-03	1
id1	2023-01-04	1
id2	2023-01-01	0
id2	2023-01-02	0
id2	2023-01-03	1
id2	2023-01-04	2
id3	2023-01-01	-1
id3	2023-01-02	0
id3	2023-01-03	1
id3	2023-01-04	-2

Step 3: Audit Data

The audit_data method requires the following parameters:

df (required): A pandas DataFrame with your input data.
freq (required): The frequency of your time series data (e.g., D for daily, M for monthly).
id_col: Column name identifying each unique series. Default is unique_id.
time_col: Column name containing timestamps. Default is ds.
target_col: Column name containing the target variable. Default is y.

Additionally, you can use the following optional parameters to specify how missing dates are identified:

start: The initial timestamp for the series.
end: The final timestamp for the series.

Both start and end can take the following options:

per_serie: Uses the first or last timestamp of each individual series.
global: Uses the earliest or latest timestamp from the entire dataset.
A specific timestamp or integer (e.g., 2025-01-01, 2025, or datetime(2025, 1, 1)).

all_pass, fail_dfs, case_specific_dfs = nixtla_client.audit_data(
    df = df,
    freq = 'D', 
    start = 'per_serie', 
    end = 'per_serie'
)

The audit_data method returns three values:

all_pass (bool): True if every check passed, otherwise False.
fail_dfs (dict): Any failed tests (D001, D002 or F001), each paired with the rows that failed.
case_specific_dfs (dict): Any case-specific tests (V001 or V002), each paired with the rows flagged.

In the example above, the audit_data method should find missing dates (D002), negative values (V001), and leading zeros (V002).

Step 4. Clean Data

The clean_data method fixes the issues identified by the audit_data method. It requires the output of audit_data, so it must always be run after it. The clean_data method takes the following parameters:

df (required): A pandas DataFrame with your input data.
fail_dict (required): A dictionary with failed checks, as returned by the audit_data method.
case_specific_dict (required): A dictionary with case-specific checks, also returned by the audit_data method.
freq (required): The frequency of your time series data (e.g., D for daily, M for monthly). Can be a string, integer, or pandas offset.
clean_case_specific: Whether to clean case-specific issues (e.g., negative values, leading zeros). Default is False.
id_col: Column name identifying each unique series. Default is unique_id.
time_col: Column name containing timestamps or integer steps. Default is ds.
target_col: Column name containing the target variable. Default is y.

clean_df, all_pass, fail_dfs, case_specific_dfs = nixtla_client.clean_data(
    df = df, 
    fail_dict = fail_dfs,
    case_specific_dict = case_specific_dfs,
    clean_case_specific = True, 
    freq = 'D'
)
clean_df

unique_id	ds	y
id1	2023-01-01	1.0
id1	2023-01-03	1.0
id1	2023-01-04	1.0
id1	2023-01-02	NaN
id2	2023-01-03	1.0
id2	2023-01-04	2.0
id3	2023-01-01	0.0
id3	2023-01-02	0.0
id3	2023-01-03	1.0
id3	2023-01-04	0.0

In this example, clean_data added the missing date in id1, removed the leading zeros in id2, and replaced the negative values in id3. However, replacing negative values with zeros introduced new leading zeros in id3, so a second run of clean_data is required.

clean_df2, all_pass, fail_dfs, case_specific_dfs = nixtla_client.clean_data(
    df = clean_df, 
    fail_dict = fail_dfs,
    case_specific_dict = case_specific_dfs,
    clean_case_specific = True, # if False, the case-specific tests will be ignored
    freq = 'D'
)
clean_df2

unique_id	ds	y
id1	2023-01-01	1.0
id1	2023-01-03	1.0
id1	2023-01-04	1.0
id1	2023-01-02	NaN
id2	2023-01-03	1.0
id2	2023-01-04	2.0
id3	2023-01-03	1.0
id3	2023-01-04	0.0

After the second run of clean_data, the leading zeros in id3 have been removed. The only remaining step is to fill the missing value created when the missing date was added in id1, and to sort the DataFrame by unique_id and ds.

unique_id	ds	y
id1	2023-01-01	1.0
id1	2023-01-02	0.0
id1	2023-01-03	1.0
id1	2023-01-04	1.0
id2	2023-01-03	1.0
id2	2023-01-04	2.0
id3	2023-01-03	1.0
id3	2023-01-04	0.0

Conclusion

The audit_data method helps you identify issues that may prevent TimeGPT from running properly. These include fail tests (duplicate rows, missing dates, and categorical feature columns), which will always result in errors if not addressed. It also flags case-specific issues (negative values and leading zeros), which may not cause errors but can affect the quality of your forecasts depending on your use case. The clean_data method can automatically fix the issues identified by audit_data. Be cautious when removing negative values or leading zeros, as they may contain important information about your data. Above all, when auditing and cleaning your data, make decisions based on the needs and context of your specific use case.

INTRODUCTION

SETUP

DATA REQUIREMENTS

FORECASTING

ANOMALY DETECTION

USE CASES

REFERENCE

About

How to Use the Audit and Clean Methods

Step 1: Import Packages

Step 2: Create Minimal Example

Step 3: Audit Data

Step 4. Clean Data

Conclusion

INTRODUCTION

SETUP

DATA REQUIREMENTS

FORECASTING

ANOMALY DETECTION

USE CASES

REFERENCE

About

​How to Use the Audit and Clean Methods

​Step 1: Import Packages

​Step 2: Create Minimal Example

​Step 3: Audit Data

​Step 4. Clean Data

​Conclusion

How to Use the Audit and Clean Methods

Step 1: Import Packages

Step 2: Create Minimal Example

Step 3: Audit Data

Step 4. Clean Data

Conclusion