I have a dataframe defined as:
datas = [['A', 51, 'id1', '2020-05-27 05:50:43.346'], ['A', 51, 'id2',
'2020-05-27 05:51:08.347'], ['B', 45, 'id3', '2020-05-24 17:24:05.142'],['B', 45, 'id4', '2020-05-24 17:23:30.141'], ['C', 34,
'id5', '2020-05-23 17:31:10.341']]
df = pd.DataFrame(datas, columns = ['col1', 'col2', 'cold_id',
'dates'])
df['dates'] = pd.to_datetime(df.dates)
looking like this
col1 col2 cold_id dates
0 A 51 id1 2020-05-27 05:50:43.346
1 A 51 id2 2020-05-27 05:51:08.347
2 B 45 id3 2020-05-24 17:24:05.142
3 B 45 id4 2020-05-24 17:23:30.141
4 C 34 id5 2020-05-23 17:31:10.341
I want to group its rows such that all rows whose date is less than 2 minutes apart form part of one group. I tried the following approach:
df.groupby([pd.Grouper(key='dates', freq='2 min'), 'col1']).agg(','.join).reset_index().sort_values('col1').reset_index(drop=True)
Which yields:
dates col1 cold_id
0 2020-05-27 05:50:00 A id1,id2
1 2020-05-24 17:22:00 B id4
2 2020-05-24 17:24:00 B id3
3 2020-05-23 17:30:00 C id5
This is not what I am looking for since rows with id3 and id4 should be within the same group as they are only 30s apart from each other.
My preferred output looks like this:
dates col1 cold_id
0 2020-05-27 05:50:43.346 A id1,id2
1 2020-05-24 17:23:30.141 B id3, id4
3 2020-05-23 17:31:10.341 C id5
How can it be achieved?
CodePudding user response:
I would use:
funcs = {c: 'first' for c in df}
funcs['cold_id'] = ','.join
out = (df.groupby(df['dates'].sort_values().diff().gt('2min').cumsum(),
as_index=False, sort=False)
.agg(funcs)
)
Note that this doesn't necessarily group by col1/col2, if you need to add this as grouper:
out = (df.groupby([df['dates'].sort_values().diff().gt('2min').cumsum(),
'col1', 'col2'], as_index=False, sort=False)
.agg(funcs)
)
Output:
col1 col2 cold_id dates
0 A 51 id1,id2 2020-05-27 05:50:43.346
1 B 45 id3,id4 2020-05-24 17:24:05.142
2 C 34 id5 2020-05-23 17:31:10.341