pandas group by time intervall with dynamic intervall start-CodePudding

I have a dataframe defined as:

datas = [['A', 51, 'id1', '2020-05-27 05:50:43.346'], ['A', 51, 'id2',
'2020-05-27 05:51:08.347'], ['B', 45, 'id3', '2020-05-24 17:24:05.142'],['B', 45, 'id4', '2020-05-24 17:23:30.141'], ['C', 34,
'id5', '2020-05-23 17:31:10.341']]
df = pd.DataFrame(datas, columns = ['col1', 'col2', 'cold_id',
'dates'])
df['dates'] = pd.to_datetime(df.dates)

looking like this

    col1    col2    cold_id dates
0   A   51  id1 2020-05-27 05:50:43.346
1   A   51  id2 2020-05-27 05:51:08.347
2   B   45  id3 2020-05-24 17:24:05.142
3   B   45  id4 2020-05-24 17:23:30.141
4   C   34  id5 2020-05-23 17:31:10.341

I want to group its rows such that all rows whose date is less than 2 minutes apart form part of one group. I tried the following approach:

df.groupby([pd.Grouper(key='dates', freq='2 min'), 'col1']).agg(','.join).reset_index().sort_values('col1').reset_index(drop=True)

Which yields:

dates   col1    cold_id
0   2020-05-27 05:50:00 A   id1,id2
1   2020-05-24 17:22:00 B   id4
2   2020-05-24 17:24:00 B   id3
3   2020-05-23 17:30:00 C   id5

This is not what I am looking for since rows with id3 and id4 should be within the same group as they are only 30s apart from each other.

My preferred output looks like this:

dates   col1    cold_id
0   2020-05-27 05:50:43.346 A   id1,id2
1   2020-05-24 17:23:30.141 B   id3, id4
3   2020-05-23 17:31:10.341 C   id5

How can it be achieved?

CodePudding user response：

I would use:

funcs = {c: 'first' for c in df}
funcs['cold_id'] = ','.join

out = (df.groupby(df['dates'].sort_values().diff().gt('2min').cumsum(),
                  as_index=False, sort=False)
         .agg(funcs)
      )

Note that this doesn't necessarily group by col1/col2, if you need to add this as grouper:

out = (df.groupby([df['dates'].sort_values().diff().gt('2min').cumsum(),
                  'col1', 'col2'], as_index=False, sort=False)
         .agg(funcs)
      )

Output:

  col1  col2  cold_id                   dates
0    A    51  id1,id2 2020-05-27 05:50:43.346
1    B    45  id3,id4 2020-05-24 17:24:05.142
2    C    34      id5 2020-05-23 17:31:10.341