Home > Net >  Pandas filter out events that do not take place within x amount of time of each other
Pandas filter out events that do not take place within x amount of time of each other

Time:11-30

I have a dataframe of events by users, and I want to keep any events that have taken place within a given time (e.g. 2 hours) of each other, that are associated to the same user and the same store. Here is an example dataframe:

user_id timestamp store_id
user_1 2021-11-26T13:40:00.000Z store_1
user_1 2021-11-26T12:20:00.000Z store_1
user_1 2021-11-22T16:10:00.000Z store_1
user_2 2021-11-19T22:00:00.000Z store_2
user_2 2021-11-19T19:50:00.000Z store_2
user_3 2021-11-28T06:10:00.000Z store_1
user_4 2021-11-18T16:30:00.000Z store_3
user_4 2021-11-18T16:20:00.000Z store_2

Applying the filtering, the ouput dataframe should look like this:

user_id timestamp store_id
user_1 2021-11-26T13:40:00.000Z store_1
user_1 2021-11-26T12:20:00.000Z store_1

Because only the first two events by user_1 took place at the same store, by the same user, and within 2 hours of one another. I have been searching through stackoverflow questions, but nothing seems to fit this scenario. Any help would be really appreciated!

EDIT: Following Time difference between two event rows for each user in Pandas df, I am calculating the time difference between rows, grouped by user.

CodePudding user response:

  1. Sort by timestamp:

    df = df.sort_values('timestamp')
    
  2. groupby the user/store and check if the time diff (forward or backward) is within the specified delta:

    (Note that keep is shown here as a column only for explanation purposes. This code doesn't actually add it as a column, though it's also fine to make it a column if preferred.)

    delta = pd.Timedelta('2H')
    
    keep = (df.groupby(['user_id', 'store_id'], sort=False)['timestamp']
        .transform(lambda g: g.diff().abs().le(delta) | g.diff(-1).abs().le(delta)))
    
    #   user_id                  timestamp  store_id   keep
    # 7  user_4  2021-11-18 16:20:00 00:00   store_2  False
    # 6  user_4  2021-11-18 16:30:00 00:00   store_3  False
    # 4  user_2  2021-11-19 19:50:00 00:00   store_2  False
    # 3  user_2  2021-11-19 22:00:00 00:00   store_2  False
    # 2  user_1  2021-11-22 16:10:00 00:00   store_1  False
    # 1  user_1  2021-11-26 12:20:00 00:00   store_1   True
    # 0  user_1  2021-11-26 13:40:00 00:00   store_1   True
    # 5  user_3  2021-11-28 06:10:00 00:00   store_1  False
    
  3. Filter with loc or slicing:

    df.loc[keep]  # or df[keep]
    
    #    user_id                  timestamp  store_id
    # 1   user_1  2021-11-26 12:20:00 00:00   store_1
    # 0   user_1  2021-11-26 13:40:00 00:00   store_1
    
  • Related