Home > Enterprise >  pandas consecutive Boolean event rollup time series
pandas consecutive Boolean event rollup time series

Time:11-16

Here's some made up time series data on 1 minute intervals:

import pandas as pd
import numpy as np
import random

random.seed(5)


rows,cols = 8760,3
data = np.random.rand(rows,cols) 
tidx = pd.date_range('2019-01-01', periods=rows, freq='1T') 
df = pd.DataFrame(data, columns=['condition1','condition2','condition3'], index=tidx)

This is just some code to create some Boolean columns

df['condition1_bool'] = df['condition1'].lt(.1)
df['condition2_bool'] = df['condition2'].lt(df['condition1']) & df['condition2'].gt(df['condition3'])
df['condition3_bool'] = df['condition3'].gt(.9)

df = df[['condition1_bool','condition2_bool','condition3_bool']]
df = df.astype(int)

On my screen this prints:

condition1_bool condition2_bool condition3_bool
2019-01-01 00:00:00 0   0   0
2019-01-01 00:01:00 0   0   1   <---- Count as same event!
2019-01-01 00:02:00 0   0   1   <---- Count as same event!
2019-01-01 00:03:00 1   0   0
2019-01-01 00:04:00 0   0   0

What I am trying to figure out is how to rollup per hour cumulative events (True or 1) but if there is no 0 between events, its the same event! Hopefully that makes sense what I was describing above on the <---- Count as same event!

If I do:

df = df.resample('H').sum()

This will just resample and count all events, right regardless of the time series commitment I was trying to highlight with the <---- Count as same event!

Thanks for any tips!!

CodePudding user response:

Check if the current row ("2019-01-01 00:02:00") equals to 1 and check if the previous row ("2019-01-01 00:01:00") is not equal to 1. This removes consecutive 1 of the sum.

>>> df.resample('H').apply(lambda x: (x.eq(1) & x.shift().ne(1)).sum())

                     condition1_bool  condition2_bool  condition3_bool
2019-01-01 00:00:00                4                8                4
2019-01-01 01:00:00                9                7                6
2019-01-01 02:00:00                7               14                4
2019-01-01 03:00:00                2                8                7
2019-01-01 04:00:00                4                9                5
...                              ...              ...              ...
2019-01-06 21:00:00                4                8                2
2019-01-06 22:00:00                3               11                4
2019-01-06 23:00:00                6               11                4
2019-01-07 00:00:00                8                7                8
2019-01-07 01:00:00                4                9                6

[146 rows x 3 columns]

Using your code:

>>> df.resample('H').sum()

                     condition1_bool  condition2_bool  condition3_bool
2019-01-01 00:00:00                5                8                5
2019-01-01 01:00:00                9                8                6
2019-01-01 02:00:00                7               14                5
2019-01-01 03:00:00                2                9                7
2019-01-01 04:00:00                4               11                5
...                              ...              ...              ...
2019-01-06 21:00:00                5               11                3
2019-01-06 22:00:00                3               15                4
2019-01-06 23:00:00                6               12                4
2019-01-07 00:00:00                8                7               10
2019-01-07 01:00:00                4                9                7

[146 rows x 3 columns]

Check:

dti = pd.date_range('2021-11-15 21:00:00', '2021-11-15 22:00:00', 
                    closed='left', freq='T')

df1 = pd.DataFrame({'c1': 1}, index=dti)
>>> df1.resample('H').apply(lambda x: (x.eq(1) & x.shift().ne(1)).sum())
                     c1
2021-11-15 21:00:00   1

>>> df1.resample('H').sum()
                     c1
2021-11-15 21:00:00  60
  • Related