Home > OS >  Create clean process dataframe and log of failures
Create clean process dataframe and log of failures

Time:04-02

I have a dataframe consisting of two columns that I am converting to datetime and numeric respectively in order to filter out the undesirable data coming from the source. However, instead of what I am doing currently, which is throwing out the NaNs, I would like to keep track of them to have a log of the invalid entries and why were they invalid.

# Format the datetime to only accept the valid datetimes.
df['ACTUAL_DATETIME'] = pd.to_datetime(df['STRING_DATETIME'], format='%Y-%m-%d%H:%M:%S', errors='coerce')
df.dropna(subset=['ACTUAL_DATETIME'], inplace=True)

# Format the contract account number
df['ACTUAL_ACCOUNT'] = pd.to_numeric(df['STRING_ACCOUNT'], errors='coerce')
# No NAs in the account or call time are allowed.
df.dropna(subset=['ACTUAL_ACCOUNT'], inplace=True)

Minimum Reproducible Example:

Input:

               STRING_DATETIME      STRING_ACCOUNT
0   2022-04-01 08:57:25.148851                 123
1   2022-04-01 08:57:25.148851  MY_INVALID_ACCOUNT
2          MY_INVALID_DATETIME                 123
3   2022-04-01 08:57:25.148851                 123

pd.DataFrame(
    {
        'STRING_DATETIME': [dt.today(), dt.today(), 'MY_INVALID_DATETIME', dt.today()],
        'STRING_ACCOUNT': [123, 'MY_INVALID_ACCOUNT', 123, 123]
    }
)

Desired Output: On the one hand, the clean dataframe:

               ACTUAL_DATETIME  ACTUAL_ACCOUNT
0   2022-04-01 08:57:26.955440             123
3   2022-04-01 08:57:26.955440             123

pd.DataFrame(
    {
        'ACTUAL_DATETIME': [dt.today()],
        'ACTUAL_ACCOUNT': [123]
    },
    index = [0, 3]
)

On the other hand, the rejected entries with the log of why they were rejected:

               STRING_DATETIME   STRING_ACCOUNT              LOG
1   2022-04-01 08:57:28.674936  INVALID_ACCOUNT  INVALID_ACCOUNT
2                         ASAS              123 INVALID_DATETIME

pd.DataFrame(
    {
        'STRING_DATETIME': [dt.today(), 'ASAS'],
        'STRING_ACCOUNT': ['INVALID_ACCOUNT', 123],
        'LOG': ['INVALID_ACCOUNT', 'INVALID_DATETIME']
    },
    index = [1, 2]
)

CodePudding user response:

IIUC, you are almost close, use boolean indexing:

m1 = pd.to_datetime(df['STRING_DATETIME'], errors='coerce').notna()
m2 = pd.to_numeric(df['STRING_ACCOUNT'], errors='coerce').notna()

valid_df = df[m1 & m2]
reject_df = df[~m1 | ~m2]

Output:

>>> valid_df
              STRING_DATETIME STRING_ACCOUNT
0  2022-04-01 08:57:25.148851            123
3  2022-04-01 08:57:25.148851            123

>>> reject_df
              STRING_DATETIME      STRING_ACCOUNT
1  2022-04-01 08:57:25.148851  MY_INVALID_ACCOUNT
2         MY_INVALID_DATETIME                 123
  • Related