I want to create groups of 3 consecutive records within another group that I already have in the dataset. The data will be sorted as per datetime and the already present group.
Also, I want to get the maximum datetime within the group of 3 records as described below.
Below example will clear out what I need -
Datetime | ID | Group of 3 | Max Time in group of 3 |
---|---|---|---|
2022-08-10 12:12:11 | 0 | 1 | 2022-08-10 12:13:11 |
2022-08-10 12:12:14 | 0 | 1 | 2022-08-10 12:13:11 |
2022-08-10 12:13:11 | 0 | 1 | 2022-08-10 12:13:11 |
2022-08-10 12:15:11 | 0 | 2 | 2022-08-10 12:15:11 |
2022-08-10 12:18:11 | 1 | 3 | 2022-08-10 12:21:11 |
2022-08-10 12:20:11 | 1 | 3 | 2022-08-10 12:21:11 |
2022-08-10 12:21:11 | 1 | 3 | 2022-08-10 12:21:11 |
2022-08-10 12:24:11 | 2 | 4 | 2022-08-10 12:29:11 |
2022-08-10 12:29:11 | 2 | 4 | 2022-08-10 12:29:11 |
2022-08-10 12:34:11 | 3 | 5 | 2022-08-10 12:34:11 |
Any SQL or Pandas solution will be appreciated. I tried using SQL window functions but couldn't get the logic right.
CodePudding user response:
the way I like to approach this is to put the result of the aggregate into a separate dataframe, then merge back with the original:
import pandas as pd
df = pd.DataFrame([
['2022-08-10 12:12:11',0,1],
['2022-08-10 12:12:14',0,1],
['2022-08-10 12:13:11',0,1],
['2022-08-10 12:15:11',0,2],
['2022-08-10 12:18:11',1,3],
['2022-08-10 12:20:11',1,3],
['2022-08-10 12:21:11',1,3],
['2022-08-10 12:24:11',2,4],
['2022-08-10 12:29:11',2,4],
['2022-08-10 12:34:11',3,5],
], columns=['Datetime', 'ID', 'Group of 3'])
maxdf = df.groupby('ID')['Datetime'].max().reset_index().rename(columns={'Datetime':'Max Time in group of 3'})
df = df.merge(maxdf, on=['ID'])
df
Datetime ID Group of 3 Max Time in group of 3
0 2022-08-10 12:12:11 0 1 2022-08-10 12:15:11
1 2022-08-10 12:12:14 0 1 2022-08-10 12:15:11
2 2022-08-10 12:13:11 0 1 2022-08-10 12:15:11
3 2022-08-10 12:15:11 0 2 2022-08-10 12:15:11
4 2022-08-10 12:18:11 1 3 2022-08-10 12:21:11
5 2022-08-10 12:20:11 1 3 2022-08-10 12:21:11
6 2022-08-10 12:21:11 1 3 2022-08-10 12:21:11
7 2022-08-10 12:24:11 2 4 2022-08-10 12:29:11
8 2022-08-10 12:29:11 2 4 2022-08-10 12:29:11
9 2022-08-10 12:34:11 3 5 2022-08-10 12:34:11
CodePudding user response:
Here's a SQL solution using row_number()
and integer-math to group by threes and choosing the max value for each group.
select Datetime
,ID
,max(Datetime) over(partition by id, grp3) as mx_grp3
from (
select *
,(row_number() over(partition by id order by Datetime)-1)/3 as grp3
from t
) t
Datetime | ID | mx_grp3 |
---|---|---|
2022-08-10 12:12:11.000 | 0 | 2022-08-10 12:13:11.000 |
2022-08-10 12:12:14.000 | 0 | 2022-08-10 12:13:11.000 |
2022-08-10 12:13:11.000 | 0 | 2022-08-10 12:13:11.000 |
2022-08-10 12:15:11.000 | 0 | 2022-08-10 12:15:11.000 |
2022-08-10 12:18:11.000 | 1 | 2022-08-10 12:21:11.000 |
2022-08-10 12:20:11.000 | 1 | 2022-08-10 12:21:11.000 |
2022-08-10 12:21:11.000 | 1 | 2022-08-10 12:21:11.000 |
2022-08-10 12:24:11.000 | 2 | 2022-08-10 12:29:11.000 |
2022-08-10 12:29:11.000 | 2 | 2022-08-10 12:29:11.000 |
2022-08-10 12:34:11.000 | 3 | 2022-08-10 12:34:11.000 |
CodePudding user response:
You can create Max Time
column as follows:
df['Max Time'] = df.groupby(['ID', df.index // 3], as_index=False)['Datetime'].\
transform('max')
and Group of 3
as follows:
df['Group of 3'] = np.where(df['Max Time'].duplicated(keep='first') == False, 1, 0).cumsum()