I have a dataframe consisting of two columns that I am converting to datetime and numeric respectively in order to filter out the undesirable data coming from the source. However, instead of what I am doing currently, which is throwing out the NaNs, I would like to keep track of them to have a log of the invalid entries and why were they invalid.
# Format the datetime to only accept the valid datetimes.
df['ACTUAL_DATETIME'] = pd.to_datetime(df['STRING_DATETIME'], format='%Y-%m-%d%H:%M:%S', errors='coerce')
df.dropna(subset=['ACTUAL_DATETIME'], inplace=True)
# Format the contract account number
df['ACTUAL_ACCOUNT'] = pd.to_numeric(df['STRING_ACCOUNT'], errors='coerce')
# No NAs in the account or call time are allowed.
df.dropna(subset=['ACTUAL_ACCOUNT'], inplace=True)
Minimum Reproducible Example:
Input:
STRING_DATETIME STRING_ACCOUNT
0 2022-04-01 08:57:25.148851 123
1 2022-04-01 08:57:25.148851 MY_INVALID_ACCOUNT
2 MY_INVALID_DATETIME 123
3 2022-04-01 08:57:25.148851 123
pd.DataFrame(
{
'STRING_DATETIME': [dt.today(), dt.today(), 'MY_INVALID_DATETIME', dt.today()],
'STRING_ACCOUNT': [123, 'MY_INVALID_ACCOUNT', 123, 123]
}
)
Desired Output: On the one hand, the clean dataframe:
ACTUAL_DATETIME ACTUAL_ACCOUNT
0 2022-04-01 08:57:26.955440 123
3 2022-04-01 08:57:26.955440 123
pd.DataFrame(
{
'ACTUAL_DATETIME': [dt.today()],
'ACTUAL_ACCOUNT': [123]
},
index = [0, 3]
)
On the other hand, the rejected entries with the log of why they were rejected:
STRING_DATETIME STRING_ACCOUNT LOG
1 2022-04-01 08:57:28.674936 INVALID_ACCOUNT INVALID_ACCOUNT
2 ASAS 123 INVALID_DATETIME
pd.DataFrame(
{
'STRING_DATETIME': [dt.today(), 'ASAS'],
'STRING_ACCOUNT': ['INVALID_ACCOUNT', 123],
'LOG': ['INVALID_ACCOUNT', 'INVALID_DATETIME']
},
index = [1, 2]
)
CodePudding user response:
IIUC, you are almost close, use boolean indexing:
m1 = pd.to_datetime(df['STRING_DATETIME'], errors='coerce').notna()
m2 = pd.to_numeric(df['STRING_ACCOUNT'], errors='coerce').notna()
valid_df = df[m1 & m2]
reject_df = df[~m1 | ~m2]
Output:
>>> valid_df
STRING_DATETIME STRING_ACCOUNT
0 2022-04-01 08:57:25.148851 123
3 2022-04-01 08:57:25.148851 123
>>> reject_df
STRING_DATETIME STRING_ACCOUNT
1 2022-04-01 08:57:25.148851 MY_INVALID_ACCOUNT
2 MY_INVALID_DATETIME 123