I have API access logs with a timestamp user ID and request payload. I want to group by user and contiguous requests within 1 minute of each other and aggregate a count within each block. So if I had:
@timestamp @data @id
2023-01-21 09:46:33.478 ... Gh8Z4
2023-01-21 09:46:33.690 ... Gh8Z4
2023-01-21 09:46:34.189 ... Gh8Z4
2023-01-21 09:48:28.282 ... Gh8Z4
2023-01-21 09:51:27.652 ... HVtpG
2023-01-21 09:51:28.682 ... Gh8Z4
2023-01-21 09:52:17.412 ... HVtpG
I would like to see something like:
@id start end count
Gh8Z4 2023-01-21 09:46:33.478 2023-01-21 09:46:34.189 3
Gh8Z4 2023-01-21 09:48:28.282 2023-01-21 09:48:28.282 1
HVtpG 2023-01-21 09:51:27.652 2023-01-21 09:52:17.412 2
Gh8Z4 2023-01-21 09:51:28.682 2023-01-21 09:51:28.682 1
CodePudding user response:
Use a groupby.agg
with a custom grouper (generated with diff
and cumsum
):
df['@timestamp'] = pd.to_datetime(df['@timestamp'])
group = df.groupby('@id')['@timestamp'].apply(lambda s: s.diff().gt('1min').cumsum())
(df.groupby(['@id', group], as_index=False, sort=False)
.agg(**{'start': ('@timestamp', 'min'),
'end': ('@timestamp', 'max'),
'count': ('@timestamp', 'count'),
})
)
Output:
@id start end count
0 Gh8Z4 2023-01-21 09:46:33.478 2023-01-21 09:46:34.189 3
1 Gh8Z4 2023-01-21 09:48:28.282 2023-01-21 09:48:28.282 1
2 HVtpG 2023-01-21 09:51:27.652 2023-01-21 09:52:17.412 2
3 Gh8Z4 2023-01-21 09:51:28.682 2023-01-21 09:51:28.682 1