I have a table like this
Unit | status | date |
---|---|---|
One | 1 | 1 |
One | 1 | 2 |
One | 1 | 3 |
One | 0 | 4 |
One | 0 | 5 |
One | 1 | 6 |
One | 1 | 7 |
and I want to create a new column where I'd have the size of the sequence of zeros from the status
column. So for that example, the output would be
Unit | status | date | gap |
---|---|---|---|
One | 1 | 1 | 0 |
One | 1 | 2 | 0 |
One | 1 | 3 | 0 |
One | 0 | 4 | 2 |
One | 0 | 5 | 2 |
One | 1 | 6 | 0 |
One | 1 | 7 | 0 |
This would have to be done for all the units in the DataFrame. I was basing myself on this question, but I'm stuck in the part where I set the total size for all the rows that are part of the gap
CodePudding user response:
The usual way to group the block of some values is to cumsum
on the other values. Given that your data is sorted by Unit
:
df['gap'] = (df.groupby(['Unit', 'status', df['status'].cumsum()])
['status'].transform('size')
.where(df['status'].eq(0), other=0)
)
Output:
Unit status date gap
0 One 1 1 0
1 One 1 2 0
2 One 1 3 0
3 One 0 4 2
4 One 0 5 2
5 One 1 6 0
6 One 1 7 0
CodePudding user response:
Another approach could be to use run-length encoding via package python-rle
:
import rle
r = rle.encode(df.status)
df['gap'] = (rle
.decode([r[1][x] if r[0][x] == 0 else 0 for x in range(len(r[0]))], r[1]))
Output:
Unit status date gap
0 One 1 1 0
1 One 1 2 0
2 One 1 3 0
3 One 0 4 2
4 One 0 5 2
5 One 1 6 0
6 One 1 7 0