I have a column (P0) with missing value that tracks the initial value of a metric and a column that tracks the percentage change (CHG). The goal is to create a new column (P1) that calculate the new metric after the CHG, as well as fill P0 with P1 when there is missing value.
import pandas as pd
df = pd.DataFrame()
df['P0'] = [50, np.nan, 60, np.nan]
df['CHG'] = [0, 0.5, -0.1, 0.2]
So in this example, the recursive calculation would be:
df.loc[1, 'P1'] = df.loc[0, 'P0'] * (1 df.loc[1, 'CHG']) = 50 * 1.5 = 75
Then because df.loc[1, 'P0']
is missing, we fill it with df.loc[1, 'P0'] = df.loc[1, 'P1']
Next,
df.loc[2, 'P1'] = df.loc[1, 'P0'] * (1 df.loc[2, 'CHG']) = 75 * 0.9 = 67.5
Now df.loc[2, 'P0']
is not missing, we do not need to fill it.
Finally, df.loc[3, 'P1'] = df.loc[2, 'P0'] * (1 df.loc[3, 'CHG']) = 60 * 1.2 = 72
Once again, we need to fill df.loc[3, 'P0'] = df.loc[3, 'P1'] = 72
I can probably write this logic into a loop that goes through rows one by one. But I am wondering if there is anything faster given the data has 1 million rows. Or maybe I can use numba to speed up the loop?
CodePudding user response:
A vectorial approach would be:
df['P1'] = (df
.assign(val=df['P0'].ffill()) # propagate start values
# group by stretches starting on non-NA values
.groupby(df['P0'].notna().cumsum().shift().bfill(),
group_keys=False)
# compute the cumulated product with the first (propagated) value
# as starting point
.apply(lambda g: g['CHG'].add(1).cumprod().mul(g['val'].iloc[0]))
)
# restore initial P1 as NaN
df['P1'].iloc[0] = np.nan
# fill P0 with P1 where NaN
df['P0'] = df['P0'].fillna(df['P1'])
output:
P0 CHG P1
0 50.0 0.0 NaN
1 75.0 0.5 75.0
2 60.0 -0.1 67.5
3 72.0 0.2 72.0