Home > Enterprise >  Create groups of 3 records within ID column grouping
Create groups of 3 records within ID column grouping

Time:10-14

I want to create groups of 3 consecutive records within another group that I already have in the dataset. The data will be sorted as per datetime and the already present group.

Also, I want to get the maximum datetime within the group of 3 records as described below.

Below example will clear out what I need -

Datetime ID Group of 3 Max Time in group of 3
2022-08-10 12:12:11 0 1 2022-08-10 12:13:11
2022-08-10 12:12:14 0 1 2022-08-10 12:13:11
2022-08-10 12:13:11 0 1 2022-08-10 12:13:11
2022-08-10 12:15:11 0 2 2022-08-10 12:15:11
2022-08-10 12:18:11 1 3 2022-08-10 12:21:11
2022-08-10 12:20:11 1 3 2022-08-10 12:21:11
2022-08-10 12:21:11 1 3 2022-08-10 12:21:11
2022-08-10 12:24:11 2 4 2022-08-10 12:29:11
2022-08-10 12:29:11 2 4 2022-08-10 12:29:11
2022-08-10 12:34:11 3 5 2022-08-10 12:34:11

Any SQL or Pandas solution will be appreciated. I tried using SQL window functions but couldn't get the logic right.

CodePudding user response:

the way I like to approach this is to put the result of the aggregate into a separate dataframe, then merge back with the original:

import pandas as pd
df = pd.DataFrame([
    ['2022-08-10 12:12:11',0,1],
    ['2022-08-10 12:12:14',0,1],
    ['2022-08-10 12:13:11',0,1],
    ['2022-08-10 12:15:11',0,2],
    ['2022-08-10 12:18:11',1,3],
    ['2022-08-10 12:20:11',1,3],
    ['2022-08-10 12:21:11',1,3],
    ['2022-08-10 12:24:11',2,4],
    ['2022-08-10 12:29:11',2,4],
    ['2022-08-10 12:34:11',3,5],
], columns=['Datetime', 'ID', 'Group of 3'])

maxdf = df.groupby('ID')['Datetime'].max().reset_index().rename(columns={'Datetime':'Max Time in group of 3'})

df = df.merge(maxdf, on=['ID'])
df

    Datetime    ID  Group of 3  Max Time in group of 3
0   2022-08-10 12:12:11 0   1   2022-08-10 12:15:11
1   2022-08-10 12:12:14 0   1   2022-08-10 12:15:11
2   2022-08-10 12:13:11 0   1   2022-08-10 12:15:11
3   2022-08-10 12:15:11 0   2   2022-08-10 12:15:11
4   2022-08-10 12:18:11 1   3   2022-08-10 12:21:11
5   2022-08-10 12:20:11 1   3   2022-08-10 12:21:11
6   2022-08-10 12:21:11 1   3   2022-08-10 12:21:11
7   2022-08-10 12:24:11 2   4   2022-08-10 12:29:11
8   2022-08-10 12:29:11 2   4   2022-08-10 12:29:11
9   2022-08-10 12:34:11 3   5   2022-08-10 12:34:11

CodePudding user response:

Here's a SQL solution using row_number() and integer-math to group by threes and choosing the max value for each group.

select  Datetime    
       ,ID
       ,max(Datetime) over(partition by id, grp3) as mx_grp3
from   (
       select *
              ,(row_number() over(partition by id order by Datetime)-1)/3 as grp3
       from   t 
       ) t
Datetime ID mx_grp3
2022-08-10 12:12:11.000 0 2022-08-10 12:13:11.000
2022-08-10 12:12:14.000 0 2022-08-10 12:13:11.000
2022-08-10 12:13:11.000 0 2022-08-10 12:13:11.000
2022-08-10 12:15:11.000 0 2022-08-10 12:15:11.000
2022-08-10 12:18:11.000 1 2022-08-10 12:21:11.000
2022-08-10 12:20:11.000 1 2022-08-10 12:21:11.000
2022-08-10 12:21:11.000 1 2022-08-10 12:21:11.000
2022-08-10 12:24:11.000 2 2022-08-10 12:29:11.000
2022-08-10 12:29:11.000 2 2022-08-10 12:29:11.000
2022-08-10 12:34:11.000 3 2022-08-10 12:34:11.000

Fiddle

CodePudding user response:

You can create Max Time column as follows:

df['Max Time'] = df.groupby(['ID', df.index // 3], as_index=False)['Datetime'].\
                     transform('max')

and Group of 3 as follows:

df['Group of 3'] = np.where(df['Max Time'].duplicated(keep='first') == False, 1, 0).cumsum()
  • Related