Home > Net >  Pandas fill two columns row by row recursively
Pandas fill two columns row by row recursively

Time:10-30

I have a column (P0) with missing value that tracks the initial value of a metric and a column that tracks the percentage change (CHG). The goal is to create a new column (P1) that calculate the new metric after the CHG, as well as fill P0 with P1 when there is missing value.

import pandas as pd
df = pd.DataFrame()
df['P0'] = [50, np.nan, 60, np.nan]
df['CHG'] = [0, 0.5, -0.1, 0.2]

So in this example, the recursive calculation would be:

df.loc[1, 'P1'] = df.loc[0, 'P0'] * (1   df.loc[1, 'CHG']) = 50 * 1.5 = 75

Then because df.loc[1, 'P0'] is missing, we fill it with df.loc[1, 'P0'] = df.loc[1, 'P1']

Next,

df.loc[2, 'P1'] = df.loc[1, 'P0'] * (1   df.loc[2, 'CHG']) = 75 * 0.9 = 67.5

Now df.loc[2, 'P0'] is not missing, we do not need to fill it.

Finally, df.loc[3, 'P1'] = df.loc[2, 'P0'] * (1 df.loc[3, 'CHG']) = 60 * 1.2 = 72

Once again, we need to fill df.loc[3, 'P0'] = df.loc[3, 'P1'] = 72

I can probably write this logic into a loop that goes through rows one by one. But I am wondering if there is anything faster given the data has 1 million rows. Or maybe I can use numba to speed up the loop?

CodePudding user response:

A vectorial approach would be:

df['P1'] = (df
   .assign(val=df['P0'].ffill()) # propagate start values
   # group by stretches starting on non-NA values
   .groupby(df['P0'].notna().cumsum().shift().bfill(),
            group_keys=False)
   # compute the cumulated product with the first (propagated) value
   # as starting point
   .apply(lambda g: g['CHG'].add(1).cumprod().mul(g['val'].iloc[0]))
)

# restore initial P1 as NaN
df['P1'].iloc[0] = np.nan

# fill P0 with P1 where NaN
df['P0'] = df['P0'].fillna(df['P1'])

output:

     P0  CHG    P1
0  50.0  0.0   NaN
1  75.0  0.5  75.0
2  60.0 -0.1  67.5
3  72.0  0.2  72.0
  • Related