I have a pandas data frame with a column where a condition is met based on other elements in the data frame (not shown). Additionally, I have a column that extends the validness one row further with the following rule:
If a valid row is followed directly by ExtendsValid, that row is also valid, even if the underlying valid condition doesnt apply. Continue filling valid forward as long as ExtendsValid is 1
I have illustrated the result in column "FinalValid" (desired result. Doesnt have to be a new column, can also fill Valid forward). Note that rows 8 and 9 in the example also become valid. Also note that row 13 does NOT result in FinalValid, because you need a preceding valid row. Preceding valid row can be Valid or an extended valid row.
So far if I had a problem like that I did a cumbersome multi-step process:
- Create a new column for when "Valid" or "ExtendsValid" is true
- Create a new column marking the start point for each "sub-series" (a consecutive set of ones)
- Number each sub-series
- fillna using "group by" for each sub series
I can provide sample code but I am really looking for a totally different, more efficient approach, which of course must be non-iterating as well.
Any ideas would be welcome.
# | Valid | ExtendsValid | FinalValid |
---|---|---|---|
1 | 0 | 0 | 0 |
2 | 1 | 0 | 1 |
3 | 0 | 0 | 0 |
4 | 0 | 0 | 0 |
5 | 1 | 0 | 1 |
6 | 0 | 0 | 0 |
7 | 1 | 0 | 1 |
8 | 0 | 1 | 1 |
9 | 0 | 1 | 1 |
10 | 0 | 0 | 0 |
11 | 1 | 0 | 1 |
12 | 0 | 0 | 0 |
13 | 0 | 1 | 0 |
14 | 0 | 0 | 0 |
CodePudding user response:
IIUC, you want to ffill the 1s only if there is an uninterrupted series of 1s starting on Valid and eventually continuing on ExtendsValid.
For this you can use a groupby.cummin
:
df['FinalValid'] = (
(df['Valid']|df['ExtendsValid'])
.groupby(df['Valid'].cumsum())
.cummin()
)
Output:
NB. I slightly modified the input on row 3 to better illustrate the logic.
# Valid ExtendsValid FinalValid
0 1 0 0 0
1 2 1 0 1
2 3 0 0 0
3 4 0 1 0
4 5 1 0 1
5 6 0 0 0
6 7 1 0 1
7 8 0 1 1
8 9 0 1 1
9 10 0 0 0
10 11 1 0 1
11 12 0 0 0
CodePudding user response:
Try:
state, data = 0, []
for v, e in zip(df.Valid, df.ExtendsValid):
state = v or (not v and e and state)
data.append(int(state))
df["FinalValid_new"] = data
print(df)
Prints:
# Valid ExtendsValid FinalValid FinalValid_new
0 1 0 0 0 0
1 2 1 0 1 1
2 3 0 0 0 0
3 4 0 0 0 0
4 5 1 0 1 1
5 6 0 0 0 0
6 7 1 0 1 1
7 8 0 1 1 1
8 9 0 1 1 1
9 10 0 0 0 0
10 11 1 0 1 1
11 12 0 0 0 0
12 13 0 1 0 0
13 14 0 0 0 0
Note: For more performance I recommend to look at numba package.
EDIT: Example using numba:
from numba import njit
@njit
def compute_valid(valid, extends_valid, out):
state = 0
for i, (v, e) in enumerate(zip(valid, extends_valid)):
state = v or (not v and e and state)
out[i] = int(state)
df["FinalValid_new"] = 0
compute_valid(
df.Valid.to_numpy(), df.ExtendsValid.to_numpy(), df.FinalValid_new.to_numpy()
)
print(df)