Home > Net >  Pandas: Group by contiguous time blocks
Pandas: Group by contiguous time blocks

Time:02-01

I have API access logs with a timestamp user ID and request payload. I want to group by user and contiguous requests within 1 minute of each other and aggregate a count within each block. So if I had:

@timestamp               @data      @id                                                                                                               
2023-01-21 09:46:33.478   ...      Gh8Z4
2023-01-21 09:46:33.690   ...      Gh8Z4
2023-01-21 09:46:34.189   ...      Gh8Z4
2023-01-21 09:48:28.282   ...      Gh8Z4
2023-01-21 09:51:27.652   ...      HVtpG
2023-01-21 09:51:28.682   ...      Gh8Z4
2023-01-21 09:52:17.412   ...      HVtpG

I would like to see something like:

@id           start                    end             count                                                                                                                  
Gh8Z4 2023-01-21 09:46:33.478  2023-01-21 09:46:34.189   3
Gh8Z4 2023-01-21 09:48:28.282  2023-01-21 09:48:28.282   1  
HVtpG 2023-01-21 09:51:27.652  2023-01-21 09:52:17.412   2
Gh8Z4 2023-01-21 09:51:28.682  2023-01-21 09:51:28.682   1    

CodePudding user response:

Use a groupby.agg with a custom grouper (generated with diff and cumsum):

df['@timestamp'] = pd.to_datetime(df['@timestamp'])

group = df.groupby('@id')['@timestamp'].apply(lambda s: s.diff().gt('1min').cumsum())

(df.groupby(['@id', group], as_index=False, sort=False)
   .agg(**{'start': ('@timestamp', 'min'),
           'end': ('@timestamp', 'max'),
           'count': ('@timestamp', 'count'),
          })
)

Output:

     @id                   start                     end  count
0  Gh8Z4 2023-01-21 09:46:33.478 2023-01-21 09:46:34.189      3
1  Gh8Z4 2023-01-21 09:48:28.282 2023-01-21 09:48:28.282      1
2  HVtpG 2023-01-21 09:51:27.652 2023-01-21 09:52:17.412      2
3  Gh8Z4 2023-01-21 09:51:28.682 2023-01-21 09:51:28.682      1
  • Related