Given the following dataframe:
id_ | col |
---|---|
1 | 0 |
2 | 1 |
3 | 0 |
4 | 0 |
5 | 1 |
6 | 0 |
7 | 0 |
I'm looking to create a column that counts the number of rows since the most recent instance of col == 1
. The final dataframe should look like:
id_ | col | count |
---|---|---|
1 | 0 | |
2 | 1 | |
3 | 0 | 1 |
4 | 0 | 2 |
5 | 1 | 3 |
6 | 0 | 1 |
7 | 0 | 2 |
I can build a simple looping function but I'm sure there's a clever vectorised way to do this. Any ideas would be most welcome!
Here's the code to create the dataframe:
from pandas import DataFrame
DataFrame(
{
'id_': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7},
'col': {0: 0, 1: 1, 2: 0, 3: 0, 4: 1, 5: 0, 6: 0},
}
)
CodePudding user response:
here is one way to do it
# Mask out the zero values, leaving 1 in place
# then count number of 1's using cumsum, this gives number of time value being 1
# shift the result, so we count row after value is changed
df['count']=df['col'].mask(df['col'].eq(0)).cumsum().ffill().shift(1)
# count value within each of the groups formed above
# this return the expected result
df['count']=df[df['count'].notna()].groupby('count')['count'].transform('cumcount') 1
df
id_ col count
0 1 0 NaN
1 2 1 NaN
2 3 0 1.0
3 4 0 2.0
4 5 1 3.0
5 6 0 1.0
6 7 0 2.0