I'm looking at a dataframe full of orders, which have timestamps, order is and an operation status for inserting/updating/canceling the order. The dataframe looks as follows:
Index | TIMESTAMP | ORDER_ID | OPERATION |
---|---|---|---|
0 | 2022-09-01 13:49:00.000 | 123 | INSERT |
1 | 2022-09-01 13:49:25.000 | 123 | UPDATE |
2 | 2022-09-01 13:49:37.000 | 123 | CANCEL |
3 | 2022-09-01 13:49:48.000 | 456 | CANCEL |
4 | 2022-09-01 13:49:49.000 | 456 | UPDATE |
5 | 2022-09-01 13:49:50.000 | 456 | INSERT |
As part of my data analysis, I would like to clean up the data to filter out "flawed" orders.
I was able to filter out operations that are missing an insert/update/cancel using df.groupby('ORDER_ID').filter(lambda x: (x.OPERATION == "CANCEL").any())
(repeating the code with the different operations, not efficient).
Next, I would like to filter out those orders in which the "ordering" (i.e., time when they occur) of the operations is wrong, e.g., when a CANCEL (or UPDATE) occurs before an INSERT.
How could I filter out those cases? It should check the TIMESTAMP of an CANCEL operation for that order id group, and check if that timestamp is earlier than the TIMESTAMP of an INSERT operation. Ideally, the solution should be handle cases when there are multiple of the same OPERATION for the same order ID, e.g., in some cases the OPERATIONS are occuring more than once (at different timestamps).
My idea so far:
df.groupby(["ORDER_ID", "OPERATION"])['TIMESTAMP'].first()
gives me a table that shows the first occurences of each different OPERATION for the ORDER_IDs. Now I would just need some way to compare those returned values inside.
CodePudding user response:
Convert column to datetimes, replace not matched datimes to missing values in Series.where
and get minimal datetime per groups, last compare both columns:
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'])
df1 = (df.assign(s1 = df['TIMESTAMP'].where(df['OPERATION'].eq('INSERT')),
s2 = df['TIMESTAMP'].where(df['OPERATION'].isin(['UPDATE','CANCEL'])))
.groupby('ORDER_ID')[['s1','s2']]
.transform('min'))
df = df[df1.s1.lt(df1.s2)]
print (df)
TIMESTAMP ORDER_ID OPERATION
0 2022-09-01 13:49:00 123 INSERT
1 2022-09-01 13:49:25 123 UPDATE
2 2022-09-01 13:49:37 123 CANCEL