Home > Enterprise >  Count of Contiguously Preceding Rows with Specific Value within Time Window in Pandas
Count of Contiguously Preceding Rows with Specific Value within Time Window in Pandas

Time:10-16

really struggling with this pandas task for a research project.

I have a dataframe df that has two columns: time (a datetime column) and result (a boolean column). I want to count the streak of TRUE rows immediately preceding the current row with a look-back window of 7 days.

For example:
If the preceding row is false, then the count is 0
If the preceding row is true, then I want to know what the streak of true rows was within the 7 day period preceding that row.

Example of expected output below.

time result DESIRED OUTPUT
5/1/21 TRUE 0 (no preceding rows)
5/6/21 TRUE 1
5/8/21 FALSE 2 (immediately preceded by streak of 2 TRUE rows in past 7 days)
5/10/21 FALSE 0
5/11/21 TRUE 0
5/14/21 TRUE 1 (preceding row is TRUE)
5/20/21 TRUE 1 (immediately preceded by streak of one TRUE rows in 1 week window)
5/21/21 TRUE 2 (immediately preceded by streak of two TRUE rows in 1 week window)
5/22/21 TRUE 2 (immediately preceded by streak of two TRUE rows in 1 week window)
5/23/21 FALSE 3 (immediately preceded by streak of three TRUE rows in 1 week window)
5/24/21 TRUE 0 (preceded by FALSE row)
5/26/21 TRUE 1 (immediately preceded by streak of 1 TRUE row)

I have been scouring Stack Overflow and racking my brain for days but just can't figure out a way to do this well. The trick with shift and groupby, e.g. df * (df.groupby((df != df.shift()).cumsum()).cumcount()) would work perfectly except it doesn't take into account the 7-day lookback window and the data is sampled irregularly, so I can't make assumptions on how many rows appear there will be in a 7-day period.

Thank you all so much for your time and help!

CodePudding user response:

I believe you were on the right track. This results of runnning example code

Please notice I converted your date column into a datetime index, which I think is required for this to work. You could always make a small, temporary dataframe to do this, if you don't want to convert the whole thing.

Using sum() works because summing a boolean column gives you the number of True values in it.

  • Related