Home > Net >  Fill pandas column forward iteratively, but without using iteration?
Fill pandas column forward iteratively, but without using iteration?

Time:11-05

I have a pandas data frame with a column where a condition is met based on other elements in the data frame (not shown). Additionally, I have a column that extends the validness one row further with the following rule:

If a valid row is followed directly by ExtendsValid, that row is also valid, even if the underlying valid condition doesnt apply. Continue filling valid forward as long as ExtendsValid is 1

I have illustrated the result in column "FinalValid" (desired result. Doesnt have to be a new column, can also fill Valid forward). Note that rows 8 and 9 in the example also become valid. Also note that row 13 does NOT result in FinalValid, because you need a preceding valid row. Preceding valid row can be Valid or an extended valid row.

So far if I had a problem like that I did a cumbersome multi-step process:

  1. Create a new column for when "Valid" or "ExtendsValid" is true
  2. Create a new column marking the start point for each "sub-series" (a consecutive set of ones)
  3. Number each sub-series
  4. fillna using "group by" for each sub series

I can provide sample code but I am really looking for a totally different, more efficient approach, which of course must be non-iterating as well.

Any ideas would be welcome.

# Valid ExtendsValid FinalValid
1 0 0 0
2 1 0 1
3 0 0 0
4 0 0 0
5 1 0 1
6 0 0 0
7 1 0 1
8 0 1 1
9 0 1 1
10 0 0 0
11 1 0 1
12 0 0 0
13 0 1 0
14 0 0 0

CodePudding user response:

IIUC, you want to ffill the 1s only if there is an uninterrupted series of 1s starting on Valid and eventually continuing on ExtendsValid.

For this you can use a groupby.cummin:

df['FinalValid'] = (
 (df['Valid']|df['ExtendsValid'])
 .groupby(df['Valid'].cumsum())
 .cummin()
 )

Output:

NB. I slightly modified the input on row 3 to better illustrate the logic.

     #  Valid  ExtendsValid  FinalValid
0    1      0             0           0
1    2      1             0           1
2    3      0             0           0
3    4      0             1           0
4    5      1             0           1
5    6      0             0           0
6    7      1             0           1
7    8      0             1           1
8    9      0             1           1
9   10      0             0           0
10  11      1             0           1
11  12      0             0           0

CodePudding user response:

Try:

state, data = 0, []
for v, e in zip(df.Valid, df.ExtendsValid):
    state = v or (not v and e and state)
    data.append(int(state))

df["FinalValid_new"] = data
print(df)

Prints:

     #  Valid  ExtendsValid  FinalValid  FinalValid_new
0    1      0             0           0               0
1    2      1             0           1               1
2    3      0             0           0               0
3    4      0             0           0               0
4    5      1             0           1               1
5    6      0             0           0               0
6    7      1             0           1               1
7    8      0             1           1               1
8    9      0             1           1               1
9   10      0             0           0               0
10  11      1             0           1               1
11  12      0             0           0               0
12  13      0             1           0               0
13  14      0             0           0               0

Note: For more performance I recommend to look at package.


EDIT: Example using :

from numba import njit


@njit
def compute_valid(valid, extends_valid, out):
    state = 0
    for i, (v, e) in enumerate(zip(valid, extends_valid)):
        state = v or (not v and e and state)
        out[i] = int(state)


df["FinalValid_new"] = 0
compute_valid(
    df.Valid.to_numpy(), df.ExtendsValid.to_numpy(), df.FinalValid_new.to_numpy()
)
print(df)
  • Related