Pandas group check if timestamp for certain value is earlier than timestamp for other certain value-CodePudding

I'm looking at a dataframe full of orders, which have timestamps, order is and an operation status for inserting/updating/canceling the order. The dataframe looks as follows:

Index	TIMESTAMP	ORDER_ID	OPERATION
0	2022-09-01 13:49:00.000	123	INSERT
1	2022-09-01 13:49:25.000	123	UPDATE
2	2022-09-01 13:49:37.000	123	CANCEL
3	2022-09-01 13:49:48.000	456	CANCEL
4	2022-09-01 13:49:49.000	456	UPDATE
5	2022-09-01 13:49:50.000	456	INSERT

As part of my data analysis, I would like to clean up the data to filter out "flawed" orders. I was able to filter out operations that are missing an insert/update/cancel using df.groupby('ORDER_ID').filter(lambda x: (x.OPERATION == "CANCEL").any()) (repeating the code with the different operations, not efficient).

Next, I would like to filter out those orders in which the "ordering" (i.e., time when they occur) of the operations is wrong, e.g., when a CANCEL (or UPDATE) occurs before an INSERT.

How could I filter out those cases? It should check the TIMESTAMP of an CANCEL operation for that order id group, and check if that timestamp is earlier than the TIMESTAMP of an INSERT operation. Ideally, the solution should be handle cases when there are multiple of the same OPERATION for the same order ID, e.g., in some cases the OPERATIONS are occuring more than once (at different timestamps).

My idea so far: df.groupby(["ORDER_ID", "OPERATION"])['TIMESTAMP'].first() gives me a table that shows the first occurences of each different OPERATION for the ORDER_IDs. Now I would just need some way to compare those returned values inside.

CodePudding user response：

Convert column to datetimes, replace not matched datimes to missing values in Series.where and get minimal datetime per groups, last compare both columns:

df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'])

df1 = (df.assign(s1 = df['TIMESTAMP'].where(df['OPERATION'].eq('INSERT')),
                 s2 = df['TIMESTAMP'].where(df['OPERATION'].isin(['UPDATE','CANCEL'])))
          .groupby('ORDER_ID')[['s1','s2']]
          .transform('min'))


df = df[df1.s1.lt(df1.s2)]
print (df)
            TIMESTAMP  ORDER_ID OPERATION
0 2022-09-01 13:49:00       123    INSERT
1 2022-09-01 13:49:25       123    UPDATE
2 2022-09-01 13:49:37       123    CANCEL