Fillna based on month over month percent change and previous value in Pandas-CodePudding

Given a time series data as follow, I need to fill NaNs in column value based on previous value and MoM_pct:

          date    value  MoM_pct
0    2012-1-31  17222.1   0.0019
1    2012-2-29      NaN   0.0101  --> calculated by (1 0.0101) * 17222.1
2    2012-3-31  15650.2   0.0121
3    2012-4-30  15603.1   0.0093
4    2012-5-31      NaN   0.0111  --> calculated by (1 0.0111) * 15603.1
5    2012-6-30      NaN   0.0112  --> calculated by (1 0.0111) * 15603.1 * (1 0.0112)
6    2012-7-31  16314.9   0.0103
7    2012-8-31  16658.9   0.0111
8    2012-9-30      NaN   0.0132  --> calculated by (1 0.0132) * 16658.9
9   2012-10-31      NaN   0.0118  --> calculated by (1 0.0132) * 16658.9 * (1 0.0118)
10  2012-11-30  18476.7   0.0121
11  2012-12-31  20334.2   0.0123
12   2013-1-31      NaN   0.0014  --> calculated by (1 0.0014) * 20334.2
13   2013-2-28      NaN   0.0087  --> calculated by (1 0.0014) * 20334.2 * (1 0.0087)

The expected result will like this:

          date        value  MoM_pct
0    2012-1-31  17222.10000   0.0019
1    2012-2-29  17396.04321   0.0101
2    2012-3-31  15650.20000   0.0121
3    2012-4-30  15603.10000   0.0093
4    2012-5-31  15776.29441   0.0111
5    2012-6-30  15952.98891   0.0112
6    2012-7-31  16314.90000   0.0103
7    2012-8-31  16658.90000   0.0111
8    2012-9-30  16878.79748   0.0132
9   2012-10-31  17077.96729   0.0118
10  2012-11-30  18476.70000   0.0121
11  2012-12-31  20334.20000   0.0123
12   2013-1-31  20362.66788   0.0014
13   2013-2-28  20539.82309   0.0087
14   2013-3-31  17641.20000   0.0150

The example data:

import pandas as pd
from numpy import nan

df = pd.DataFrame({
    'date': {0: '2012-1-31', 1: '2012-2-29', 2: '2012-3-31', 3: '2012-4-30',
             4: '2012-5-31', 5: '2012-6-30', 6: '2012-7-31', 7: '2012-8-31',
             8: '2012-9-30', 9: '2012-10-31', 10: '2012-11-30',
             11: '2012-12-31', 12: '2013-1-31', 13: '2013-2-28',
             14: '2013-3-31'},
    'value': {0: 17222.1, 1: nan, 2: 15650.2, 3: 15603.1, 4: nan, 5: nan,
              6: 16314.9, 7: 16658.9, 8: nan, 9: nan, 10: 18476.7, 11: 20334.2,
              12: nan, 13: nan, 14: 17641.2},
    'MoM_pct': {0: 0.0019, 1: 0.0101, 2: 0.0121, 3: 0.0093, 4: 0.0111,
                5: 0.0112, 6: 0.0103, 7: 0.0111, 8: 0.0132, 9: 0.0118,
                10: 0.0121, 11: 0.0123, 12: 0.0014, 13: 0.0087, 14: 0.015}
})

My code below works for the example data, but it's not suitable if sample has more than two consecutive values are NaNs, also not concise:

df['value1'] = df['value'].shift(1)*df['MoM_pct'].add(1)
df['value2'] = df['value1'].shift(1)*df['MoM_pct'].add(1)
df['value'].fillna(df['value1'], inplace=True)
df['value'].fillna(df['value2'], inplace=True)

How can I fix my code above?

CodePudding user response：

We can create the multipliers from the MoM_pct column with groupby cumprod then fillna the valid values with the multiplicative identity 1 and multiply with the ffilled value column:

m = df['value'].isna()
df['value'] = (
        df['value'].ffill() *
        (1   df['MoM_pct']).groupby((~m).cumsum()[m]).cumprod().fillna(1)
)

Or equivalently with multiply and a fill_value instead of fillna:

m = df['value'].isna()
df['value'] = (
    df['value'].ffill().multiply(
        df['MoM_pct'].add(1).groupby((~m).cumsum()[m]).cumprod(),
        fill_value=1
    )
)

df:

          date         value  MoM_pct
0    2012-1-31  17222.100000   0.0019
1    2012-2-29  17396.043210   0.0101
2    2012-3-31  15650.200000   0.0121
3    2012-4-30  15603.100000   0.0093
4    2012-5-31  15776.294410   0.0111
5    2012-6-30  15952.988907   0.0112
6    2012-7-31  16314.900000   0.0103
7    2012-8-31  16658.900000   0.0111
8    2012-9-30  16878.797480   0.0132
9   2012-10-31  17077.967290   0.0118
10  2012-11-30  18476.700000   0.0121
11  2012-12-31  20334.200000   0.0123
12   2013-1-31  20362.667880   0.0014
13   2013-2-28  20539.823091   0.0087
14   2013-3-31  17641.200000   0.0150

Here is the breakdown of steps as a DataFrame:

m = df['value'].isna()
print(pd.DataFrame({
    'm': m,
    'groups': (~m).cumsum(),
    'filtered groups': (~m).cumsum()[m],
    'result of cumprod': (1   df['MoM_pct']).groupby(
        (~m).cumsum()[m]
    ).cumprod(),
    'fill missing with 1': (1   df['MoM_pct']).groupby(
        (~m).cumsum()[m]
    ).cumprod().fillna(1)
}))

        m  groups  filtered groups  result of cumprod  fill missing with 1
0   False       1              NaN                NaN             1.000000
1    True       1              1.0           1.010100             1.010100
2   False       2              NaN                NaN             1.000000
3   False       3              NaN                NaN             1.000000
4    True       3              3.0           1.011100             1.011100
5    True       3              3.0           1.022424             1.022424
6   False       4              NaN                NaN             1.000000
7   False       5              NaN                NaN             1.000000
8    True       5              5.0           1.013200             1.013200
9    True       5              5.0           1.025156             1.025156
10  False       6              NaN                NaN             1.000000
11  False       7              NaN                NaN             1.000000
12   True       7              7.0           1.001400             1.001400
13   True       7              7.0           1.010112             1.010112
14  False       8              NaN                NaN             1.000000