I have a dataframe of events by users, and I want to keep any events that have taken place within a given time (e.g. 2 hours) of each other, that are associated to the same user and the same store. Here is an example dataframe:
user_id | timestamp | store_id |
---|---|---|
user_1 | 2021-11-26T13:40:00.000Z | store_1 |
user_1 | 2021-11-26T12:20:00.000Z | store_1 |
user_1 | 2021-11-22T16:10:00.000Z | store_1 |
user_2 | 2021-11-19T22:00:00.000Z | store_2 |
user_2 | 2021-11-19T19:50:00.000Z | store_2 |
user_3 | 2021-11-28T06:10:00.000Z | store_1 |
user_4 | 2021-11-18T16:30:00.000Z | store_3 |
user_4 | 2021-11-18T16:20:00.000Z | store_2 |
Applying the filtering, the ouput dataframe should look like this:
user_id | timestamp | store_id |
---|---|---|
user_1 | 2021-11-26T13:40:00.000Z | store_1 |
user_1 | 2021-11-26T12:20:00.000Z | store_1 |
Because only the first two events by user_1
took place at the same store, by the same user, and within 2 hours of one another. I have been searching through stackoverflow questions, but nothing seems to fit this scenario. Any help would be really appreciated!
EDIT: Following Time difference between two event rows for each user in Pandas df, I am calculating the time difference between rows, grouped by user.
CodePudding user response:
Sort by
timestamp
:df = df.sort_values('timestamp')
groupby
the user/store and check if the timediff
(forward or backward) is within the specifieddelta
:(Note that
keep
is shown here as a column only for explanation purposes. This code doesn't actually add it as a column, though it's also fine to make it a column if preferred.)delta = pd.Timedelta('2H') keep = (df.groupby(['user_id', 'store_id'], sort=False)['timestamp'] .transform(lambda g: g.diff().abs().le(delta) | g.diff(-1).abs().le(delta))) # user_id timestamp store_id keep # 7 user_4 2021-11-18 16:20:00 00:00 store_2 False # 6 user_4 2021-11-18 16:30:00 00:00 store_3 False # 4 user_2 2021-11-19 19:50:00 00:00 store_2 False # 3 user_2 2021-11-19 22:00:00 00:00 store_2 False # 2 user_1 2021-11-22 16:10:00 00:00 store_1 False # 1 user_1 2021-11-26 12:20:00 00:00 store_1 True # 0 user_1 2021-11-26 13:40:00 00:00 store_1 True # 5 user_3 2021-11-28 06:10:00 00:00 store_1 False
Filter with
loc
or slicing:df.loc[keep] # or df[keep] # user_id timestamp store_id # 1 user_1 2021-11-26 12:20:00 00:00 store_1 # 0 user_1 2021-11-26 13:40:00 00:00 store_1