Adding time of sequential events in pandas dataframe-CodePudding

I have a pandas dataframe (df) that looks like this:

                      date   NTU
450     2021-12-02 13:30:00 3.09
1167    2021-12-05 01:15:00 3.67
1298    2021-12-05 12:10:00 2.12
1317    2021-12-05 13:45:00 2.18
1318    2021-12-05 13:50:00 2.08
1371    2021-12-05 18:15:00 2.36
1372    2021-12-05 18:20:00 2.36
1373    2021-12-05 18:25:00 2.36
1393    2021-12-05 20:05:00 2.12
1394    2021-12-05 20:10:00 2.29
1400    2021-12-05 20:40:00 2.17
1553    2021-12-06 09:25:00 9.97
6440    2021-12-23 08:35:00 5.42

Every row in this dataframe contains NTU values > 2.0 and lasts for 5 minutes.

When the index values are sequential, the event lasted for more than 5 minutes. I want to find the events that lasted longer than 5 minutes and sum the length of the event.

The result would look like this:

                      date   NTU  duration
1318    2021-12-05 13:50:00 2.08        10
1373    2021-12-05 18:25:00 2.36        15
1394    2021-12-05 20:10:00 2.29        10

(the results could be the index, start time of the event and duration too, it really doesn't matter.)

The results are that the event (index 1318) that ended on 2021-12-05 at 13:50 lasted 10 minutes;

the event (index 1373) that ended on 2021-12-05 at 18:25 lasted 15 minutes; and

the event (index 1394) that ended on 2021-12-05 at 20:10 lasted 10 minutes.

Any help in creating the desired output is appreciated.

CodePudding user response：

IIUC, you can use:

group = df.index.to_series().diff().ne(1).cumsum()
(df.groupby(group)
   .agg(**{'date': ('date', 'last'),
           'NTU': ('NTU', 'last'),
           'count': ('NTU', 'size'),
           'duration': ('date', lambda x: (x.max()-x.min()).total_seconds()//60 5)
           })
   .loc[lambda d: d.pop('count').gt(1)]
)

output:

                 date   NTU  duration
4 2021-12-05 13:50:00  2.08      10.0
5 2021-12-05 18:25:00  2.36      15.0
6 2021-12-05 20:10:00  2.29      10.0

CodePudding user response：

all_sequential = pd.Series(df.index).diff().eq(1)
last_sequential = all_sequential & ~all_sequential.shift(-1, fill_value=False)

df.index[last_sequential]

should produce the index values: [1318, 1373, 1394]