Create groups of 3 records within ID column grouping-CodePudding

I want to create groups of 3 consecutive records within another group that I already have in the dataset. The data will be sorted as per datetime and the already present group.

Also, I want to get the maximum datetime within the group of 3 records as described below.

Below example will clear out what I need -

Datetime	ID	Group of 3	Max Time in group of 3
2022-08-10 12:12:11	0	1	2022-08-10 12:13:11
2022-08-10 12:12:14	0	1	2022-08-10 12:13:11
2022-08-10 12:13:11	0	1	2022-08-10 12:13:11
2022-08-10 12:15:11	0	2	2022-08-10 12:15:11
2022-08-10 12:18:11	1	3	2022-08-10 12:21:11
2022-08-10 12:20:11	1	3	2022-08-10 12:21:11
2022-08-10 12:21:11	1	3	2022-08-10 12:21:11
2022-08-10 12:24:11	2	4	2022-08-10 12:29:11
2022-08-10 12:29:11	2	4	2022-08-10 12:29:11
2022-08-10 12:34:11	3	5	2022-08-10 12:34:11

Any SQL or Pandas solution will be appreciated. I tried using SQL window functions but couldn't get the logic right.

CodePudding user response：

the way I like to approach this is to put the result of the aggregate into a separate dataframe, then merge back with the original:

import pandas as pd
df = pd.DataFrame([
    ['2022-08-10 12:12:11',0,1],
    ['2022-08-10 12:12:14',0,1],
    ['2022-08-10 12:13:11',0,1],
    ['2022-08-10 12:15:11',0,2],
    ['2022-08-10 12:18:11',1,3],
    ['2022-08-10 12:20:11',1,3],
    ['2022-08-10 12:21:11',1,3],
    ['2022-08-10 12:24:11',2,4],
    ['2022-08-10 12:29:11',2,4],
    ['2022-08-10 12:34:11',3,5],
], columns=['Datetime', 'ID', 'Group of 3'])

maxdf = df.groupby('ID')['Datetime'].max().reset_index().rename(columns={'Datetime':'Max Time in group of 3'})

df = df.merge(maxdf, on=['ID'])
df

    Datetime    ID  Group of 3  Max Time in group of 3
0   2022-08-10 12:12:11 0   1   2022-08-10 12:15:11
1   2022-08-10 12:12:14 0   1   2022-08-10 12:15:11
2   2022-08-10 12:13:11 0   1   2022-08-10 12:15:11
3   2022-08-10 12:15:11 0   2   2022-08-10 12:15:11
4   2022-08-10 12:18:11 1   3   2022-08-10 12:21:11
5   2022-08-10 12:20:11 1   3   2022-08-10 12:21:11
6   2022-08-10 12:21:11 1   3   2022-08-10 12:21:11
7   2022-08-10 12:24:11 2   4   2022-08-10 12:29:11
8   2022-08-10 12:29:11 2   4   2022-08-10 12:29:11
9   2022-08-10 12:34:11 3   5   2022-08-10 12:34:11

CodePudding user response：

Here's a SQL solution using row_number() and integer-math to group by threes and choosing the max value for each group.

select  Datetime    
       ,ID
       ,max(Datetime) over(partition by id, grp3) as mx_grp3
from   (
       select *
              ,(row_number() over(partition by id order by Datetime)-1)/3 as grp3
       from   t 
       ) t

Datetime	ID	mx_grp3
2022-08-10 12:12:11.000	0	2022-08-10 12:13:11.000
2022-08-10 12:12:14.000	0	2022-08-10 12:13:11.000
2022-08-10 12:13:11.000	0	2022-08-10 12:13:11.000
2022-08-10 12:15:11.000	0	2022-08-10 12:15:11.000
2022-08-10 12:18:11.000	1	2022-08-10 12:21:11.000
2022-08-10 12:20:11.000	1	2022-08-10 12:21:11.000
2022-08-10 12:21:11.000	1	2022-08-10 12:21:11.000
2022-08-10 12:24:11.000	2	2022-08-10 12:29:11.000
2022-08-10 12:29:11.000	2	2022-08-10 12:29:11.000
2022-08-10 12:34:11.000	3	2022-08-10 12:34:11.000

Fiddle

CodePudding user response：

You can create Max Time column as follows:

df['Max Time'] = df.groupby(['ID', df.index // 3], as_index=False)['Datetime'].\
                     transform('max')

and Group of 3 as follows:

df['Group of 3'] = np.where(df['Max Time'].duplicated(keep='first') == False, 1, 0).cumsum()