I have a pandas dataframe (df) that looks like this:
date NTU
450 2021-12-02 13:30:00 3.09
1167 2021-12-05 01:15:00 3.67
1298 2021-12-05 12:10:00 2.12
1317 2021-12-05 13:45:00 2.18
1318 2021-12-05 13:50:00 2.08
1371 2021-12-05 18:15:00 2.36
1372 2021-12-05 18:20:00 2.36
1373 2021-12-05 18:25:00 2.36
1393 2021-12-05 20:05:00 2.12
1394 2021-12-05 20:10:00 2.29
1400 2021-12-05 20:40:00 2.17
1553 2021-12-06 09:25:00 9.97
6440 2021-12-23 08:35:00 5.42
Every row in this dataframe contains NTU values > 2.0 and lasts for 5 minutes.
When the index values are sequential, the event lasted for more than 5 minutes. I want to find the events that lasted longer than 5 minutes and sum the length of the event.
The result would look like this:
date NTU duration
1318 2021-12-05 13:50:00 2.08 10
1373 2021-12-05 18:25:00 2.36 15
1394 2021-12-05 20:10:00 2.29 10
(the results could be the index, start time of the event and duration too, it really doesn't matter.)
The results are that the event (index 1318) that ended on 2021-12-05 at 13:50 lasted 10 minutes;
the event (index 1373) that ended on 2021-12-05 at 18:25 lasted 15 minutes; and
the event (index 1394) that ended on 2021-12-05 at 20:10 lasted 10 minutes.
Any help in creating the desired output is appreciated.
CodePudding user response:
IIUC, you can use:
group = df.index.to_series().diff().ne(1).cumsum()
(df.groupby(group)
.agg(**{'date': ('date', 'last'),
'NTU': ('NTU', 'last'),
'count': ('NTU', 'size'),
'duration': ('date', lambda x: (x.max()-x.min()).total_seconds()//60 5)
})
.loc[lambda d: d.pop('count').gt(1)]
)
output:
date NTU duration
4 2021-12-05 13:50:00 2.08 10.0
5 2021-12-05 18:25:00 2.36 15.0
6 2021-12-05 20:10:00 2.29 10.0
CodePudding user response:
all_sequential = pd.Series(df.index).diff().eq(1)
last_sequential = all_sequential & ~all_sequential.shift(-1, fill_value=False)
df.index[last_sequential]
should produce the index values: [1318, 1373, 1394]