Audit and Clean Data
Learn how to audit and clean your data with TimeGPT.
The audit_data
and clean_data
methods from TimeGPT can help you identify and fix potential issues in your data.
The audit_data
method checks for common problems such as duplicates, missing dates, categorical columns, negative values, and leading zeros. While not all issues will result in errors, addressing them can improve the quality of the forecasts, depending on your specific use case.
Once identified, clean_data
can be used to automatically fix these issues.
How to Use the Audit and Clean Methods
Step 1: Import Packages
To use the audit_data
and clean_data
methods, you first need to import and instantiate the NixtlaClient
class.
Step 2: Create Minimal Example
The audit_data
method performs a series of checks to identify issues in your data. These checks fall into two categories:
Check Type | Description | Checks Performed |
---|---|---|
Fail | Issues that will cause errors when you run TimeGPT | Duplicate rows (D001) |
Case-specific | Issues that may not cause errors but could negatively affect your results | Negative values (V001) |
To show how the audit_data
method works, we will create a sample dataset with missing dates, negative values and leading zeros.
unique_id | ds | y |
---|---|---|
id1 | 2023-01-01 | 1 |
id1 | 2023-01-03 | 1 |
id1 | 2023-01-04 | 1 |
id2 | 2023-01-01 | 0 |
id2 | 2023-01-02 | 0 |
id2 | 2023-01-03 | 1 |
id2 | 2023-01-04 | 2 |
id3 | 2023-01-01 | -1 |
id3 | 2023-01-02 | 0 |
id3 | 2023-01-03 | 1 |
id3 | 2023-01-04 | -2 |
Step 3: Audit Data
The audit_data
method requires the following parameters:
-
df
(required): A pandas DataFrame with your input data. -
freq
(required): The frequency of your time series data (e.g.,D
for daily,M
for monthly). -
id_col
: Column name identifying each unique series. Default isunique_id
. -
time_col
: Column name containing timestamps. Default isds
. -
target_col
: Column name containing the target variable. Default isy
.
Additionally, you can use the following optional parameters to specify how missing dates are identified:
-
start
: The initial timestamp for the series. -
end
: The final timestamp for the series.
Both start
and end
can take the following options:
-
per_serie
: Uses the first or last timestamp of each individual series. -
global
: Uses the earliest or latest timestamp from the entire dataset. -
A specific timestamp or integer (e.g.,
2025-01-01
,2025
, ordatetime(2025, 1, 1)
).
The audit_data method returns three values:
-
all_pass (bool): True if every check passed, otherwise False.
-
fail_dfs (dict): Any failed tests (D001, D002 or F001), each paired with the rows that failed.
-
case_specific_dfs (dict): Any case-specific tests (V001 or V002), each paired with the rows flagged.
In the example above, the audit_data
method should find missing dates (D002), negative values (V001), and leading zeros (V002).
Step 4. Clean Data
The clean_data
method fixes the issues identified by the audit_data
method. It requires the output of audit_data
, so it must always be run after it. The clean_data
method takes the following parameters:
-
df
(required): A pandas DataFrame with your input data. -
fail_dict
(required): A dictionary with failed checks, as returned by theaudit_data
method. -
case_specific_dict
(required): A dictionary with case-specific checks, also returned by theaudit_data
method. -
freq
(required): The frequency of your time series data (e.g.,D
for daily,M
for monthly). Can be a string, integer, or pandas offset. -
clean_case_specific
: Whether to clean case-specific issues (e.g., negative values, leading zeros). Default isFalse
. -
id_col
: Column name identifying each unique series. Default isunique_id
. -
time_col
: Column name containing timestamps or integer steps. Default isds
. -
target_col
: Column name containing the target variable. Default isy
.
unique_id | ds | y |
---|---|---|
id1 | 2023-01-01 | 1.0 |
id1 | 2023-01-03 | 1.0 |
id1 | 2023-01-04 | 1.0 |
id1 | 2023-01-02 | NaN |
id2 | 2023-01-03 | 1.0 |
id2 | 2023-01-04 | 2.0 |
id3 | 2023-01-01 | 0.0 |
id3 | 2023-01-02 | 0.0 |
id3 | 2023-01-03 | 1.0 |
id3 | 2023-01-04 | 0.0 |
In this example, clean_data
added the missing date in id1
, removed the leading zeros in id2
, and replaced the negative values in id3
.
However, replacing negative values with zeros introduced new leading zeros in id3
, so a second run of clean_data
is required.
unique_id | ds | y |
---|---|---|
id1 | 2023-01-01 | 1.0 |
id1 | 2023-01-03 | 1.0 |
id1 | 2023-01-04 | 1.0 |
id1 | 2023-01-02 | NaN |
id2 | 2023-01-03 | 1.0 |
id2 | 2023-01-04 | 2.0 |
id3 | 2023-01-03 | 1.0 |
id3 | 2023-01-04 | 0.0 |
After the second run of clean_data
, the leading zeros in id3
have been removed.
The only remaining step is to fill the missing value created when the missing date was added in id1
, and to sort the DataFrame by unique_id
and ds
.
unique_id | ds | y |
---|---|---|
id1 | 2023-01-01 | 1.0 |
id1 | 2023-01-02 | 0.0 |
id1 | 2023-01-03 | 1.0 |
id1 | 2023-01-04 | 1.0 |
id2 | 2023-01-03 | 1.0 |
id2 | 2023-01-04 | 2.0 |
id3 | 2023-01-03 | 1.0 |
id3 | 2023-01-04 | 0.0 |
Conclusion
The audit_data
method helps you identify issues that may prevent TimeGPT from running properly.
These include fail tests (duplicate rows, missing dates, and categorical feature columns), which will always result in errors if not addressed.
It also flags case-specific issues (negative values and leading zeros), which may not cause errors but can affect the quality of your forecasts depending on your use case.
The clean_data
method can automatically fix the issues identified by audit_data
.
Be cautious when removing negative values or leading zeros, as they may contain important information about your data.
Above all, when auditing and cleaning your data, make decisions based on the needs and context of your specific use case.