Given a time series data as follow, I need to fill NaN
s in column value
based on previous value and MoM_pct
:
date value MoM_pct
0 2012-1-31 17222.1 0.0019
1 2012-2-29 NaN 0.0101 --> calculated by (1 0.0101) * 17222.1
2 2012-3-31 15650.2 0.0121
3 2012-4-30 15603.1 0.0093
4 2012-5-31 NaN 0.0111 --> calculated by (1 0.0111) * 15603.1
5 2012-6-30 NaN 0.0112 --> calculated by (1 0.0111) * 15603.1 * (1 0.0112)
6 2012-7-31 16314.9 0.0103
7 2012-8-31 16658.9 0.0111
8 2012-9-30 NaN 0.0132 --> calculated by (1 0.0132) * 16658.9
9 2012-10-31 NaN 0.0118 --> calculated by (1 0.0132) * 16658.9 * (1 0.0118)
10 2012-11-30 18476.7 0.0121
11 2012-12-31 20334.2 0.0123
12 2013-1-31 NaN 0.0014 --> calculated by (1 0.0014) * 20334.2
13 2013-2-28 NaN 0.0087 --> calculated by (1 0.0014) * 20334.2 * (1 0.0087)
The expected result will like this:
date value MoM_pct
0 2012-1-31 17222.10000 0.0019
1 2012-2-29 17396.04321 0.0101
2 2012-3-31 15650.20000 0.0121
3 2012-4-30 15603.10000 0.0093
4 2012-5-31 15776.29441 0.0111
5 2012-6-30 15952.98891 0.0112
6 2012-7-31 16314.90000 0.0103
7 2012-8-31 16658.90000 0.0111
8 2012-9-30 16878.79748 0.0132
9 2012-10-31 17077.96729 0.0118
10 2012-11-30 18476.70000 0.0121
11 2012-12-31 20334.20000 0.0123
12 2013-1-31 20362.66788 0.0014
13 2013-2-28 20539.82309 0.0087
14 2013-3-31 17641.20000 0.0150
The example data:
import pandas as pd
from numpy import nan
df = pd.DataFrame({
'date': {0: '2012-1-31', 1: '2012-2-29', 2: '2012-3-31', 3: '2012-4-30',
4: '2012-5-31', 5: '2012-6-30', 6: '2012-7-31', 7: '2012-8-31',
8: '2012-9-30', 9: '2012-10-31', 10: '2012-11-30',
11: '2012-12-31', 12: '2013-1-31', 13: '2013-2-28',
14: '2013-3-31'},
'value': {0: 17222.1, 1: nan, 2: 15650.2, 3: 15603.1, 4: nan, 5: nan,
6: 16314.9, 7: 16658.9, 8: nan, 9: nan, 10: 18476.7, 11: 20334.2,
12: nan, 13: nan, 14: 17641.2},
'MoM_pct': {0: 0.0019, 1: 0.0101, 2: 0.0121, 3: 0.0093, 4: 0.0111,
5: 0.0112, 6: 0.0103, 7: 0.0111, 8: 0.0132, 9: 0.0118,
10: 0.0121, 11: 0.0123, 12: 0.0014, 13: 0.0087, 14: 0.015}
})
My code below works for the example data, but it's not suitable if sample has more than two consecutive values are NaN
s, also not concise:
df['value1'] = df['value'].shift(1)*df['MoM_pct'].add(1)
df['value2'] = df['value1'].shift(1)*df['MoM_pct'].add(1)
df['value'].fillna(df['value1'], inplace=True)
df['value'].fillna(df['value2'], inplace=True)
How can I fix my code above?
CodePudding user response:
We can create the multipliers from the MoM_pct
column with groupby cumprod
then fillna
the valid values with the multiplicative identity 1
and multiply with the ffill
ed value
column:
m = df['value'].isna()
df['value'] = (
df['value'].ffill() *
(1 df['MoM_pct']).groupby((~m).cumsum()[m]).cumprod().fillna(1)
)
Or equivalently with multiply
and a fill_value
instead of fillna
:
m = df['value'].isna()
df['value'] = (
df['value'].ffill().multiply(
df['MoM_pct'].add(1).groupby((~m).cumsum()[m]).cumprod(),
fill_value=1
)
)
df
:
date value MoM_pct
0 2012-1-31 17222.100000 0.0019
1 2012-2-29 17396.043210 0.0101
2 2012-3-31 15650.200000 0.0121
3 2012-4-30 15603.100000 0.0093
4 2012-5-31 15776.294410 0.0111
5 2012-6-30 15952.988907 0.0112
6 2012-7-31 16314.900000 0.0103
7 2012-8-31 16658.900000 0.0111
8 2012-9-30 16878.797480 0.0132
9 2012-10-31 17077.967290 0.0118
10 2012-11-30 18476.700000 0.0121
11 2012-12-31 20334.200000 0.0123
12 2013-1-31 20362.667880 0.0014
13 2013-2-28 20539.823091 0.0087
14 2013-3-31 17641.200000 0.0150
Here is the breakdown of steps as a DataFrame:
m = df['value'].isna()
print(pd.DataFrame({
'm': m,
'groups': (~m).cumsum(),
'filtered groups': (~m).cumsum()[m],
'result of cumprod': (1 df['MoM_pct']).groupby(
(~m).cumsum()[m]
).cumprod(),
'fill missing with 1': (1 df['MoM_pct']).groupby(
(~m).cumsum()[m]
).cumprod().fillna(1)
}))
m groups filtered groups result of cumprod fill missing with 1
0 False 1 NaN NaN 1.000000
1 True 1 1.0 1.010100 1.010100
2 False 2 NaN NaN 1.000000
3 False 3 NaN NaN 1.000000
4 True 3 3.0 1.011100 1.011100
5 True 3 3.0 1.022424 1.022424
6 False 4 NaN NaN 1.000000
7 False 5 NaN NaN 1.000000
8 True 5 5.0 1.013200 1.013200
9 True 5 5.0 1.025156 1.025156
10 False 6 NaN NaN 1.000000
11 False 7 NaN NaN 1.000000
12 True 7 7.0 1.001400 1.001400
13 True 7 7.0 1.010112 1.010112
14 False 8 NaN NaN 1.000000