Pandas DataFrame conditional forward filling based on first row values-CodePudding

I have the following DataFrame:

import pandas as pd
df = pd.DataFrame({
    'col1':['A',pd.NA,pd.NA,pd.NA,pd.NA, 'B', pd.NA, pd.NA],
    'col2':[9.5, 6,24,8, 30, 7, 6, 8],
    })
print(df)

Giving:

   col1  col2
0     A   9.5
1  <NA>   6.0
2  <NA>  24.0
3  <NA>   8.0
4  <NA>  30.0
5     B   7.0
6  <NA>   6.0
7  <NA>   8.0

What I'd like to achieve is to forward fill col1, but not in rows where the value in col2 is greater than the col2 value of the row I'm forward filling from. Like so:

   col1  col2
0     A   9.5
1     A   6.0
2  <NA>  24.0
3     A   8.0
4  <NA>  30.0
5     B   7.0
6     B   6.0
7  <NA>   8.0

Here, index 1 and 3 are forward filled, but index 2 and 4 are not, as the col2 values (24, 30) are greater than the initial row's col2 value (9.5)

I can achieve the desired result by iterating through the df, like so:

val2 = -1
for i, r in df.iterrows():
    if not pd.isnull(r['col1']):
        val1 = r['col1']
        val2 = r['col2']
    else:
        if r['col2']<val2:
            df.loc[i, 'col1'] = val1
print(df)

Is there a way to achieve this without using iterrows()?

CodePudding user response：

You can forward filling missing values to helper Series s and then compare groups by this Series - first value by all values and if values are not greater replace values in numpy.where:

s = df['col1'].ffill()

df['col1'] = np.where(df.groupby(s)['col2'].transform('first').gt(df['col2']),s,df['col1'])
print (df)
   col1  col2
0     A   9.5
1     A   6.0
2  <NA>  24.0
3     A   8.0
4  <NA>  30.0
5     B   7.0
6     B   6.0
7  <NA>   8.0

CodePudding user response：

You can also group "col1" by the non-NaN values, get the first corresponding "col2" values and transform it; then compare it with "col2" to create a boolean mask (the rows that have "col2" values less than the "col2" value of the filler). Then ffill the ones who are caught by this mask and concatenate it with the ones who are not caught by it.

msk = df.groupby(df['col1'].notna().cumsum())['col2'].transform('first') >= df['col2']
df = pd.concat((df[msk].ffill(), df[~msk])).sort_index()

Output:

  col1  col2
0    A   9.5
1    A   6.0
2  NaN  24.0
3    A   8.0
4  NaN  30.0
5    B   7.0
6    B   6.0
7  NaN   8.0