I have a df with index extending past the last data point:
df
2022-01-31 96.210 21649.6
2022-02-28 96.390 21708.4
2022-03-31 97.410 21739.7
2022-04-30 98.630 21644.3
2022-05-31 103.744 21649.2
2022-06-30 102.498 21607.4
2022-07-31 105.138 21636.1
2022-08-31 105.450 21631.8
2022-09-30 109.691 21503.1
2022-10-31 111.745 21414.8
2022-11-30 111.481 21351.6
2022-12-31 104.728 NaN
2023-01-31 103.522 NaN
2023-02-28 NaN
2023-03-31 NaN
2023-04-30 NaN
2023-05-31 NaN
2023-06-30 NaN
2023-07-31 NaN
2023-08-31 NaN
and when I compute pct_change, pandas treats NaNs as values and extends the pct_change calculation past the last actual data point. So instead stopping on 2023-01-31 for the first column, pandas continues computing values for 2023-02-28 and so on:
df.pct_change(12)
2022-01-31 0.069713 0.117543
2022-02-28 0.059464 0.106713
2022-03-31 0.069969 0.094989
2022-04-30 0.061336 0.076258
2022-05-31 0.140671 0.060263
2022-06-30 0.141022 0.056075
2022-07-31 0.135400 0.049517
2022-08-31 0.145573 0.038228
2022-09-30 0.186490 0.025632
2022-10-31 0.188271 0.012812
2022-11-30 0.187484 0.000112
2022-12-31 0.090576 -0.006440
2023-01-31 0.076000 -0.013765
2023-02-28 0.073991 -0.016436
2023-03-31 0.062745 -0.017852
2023-04-30 0.049600 -0.013523
2023-05-31 -0.002140 -0.013746
2023-06-30 0.009990 -0.011839
2023-07-31 -0.015370 -0.013149
2023-08-31 -0.018284 -0.012953
How can I tell pandas to compute pct_change only upto NaN values? So the output is:
2022-01-31 0.069713 0.117543
2022-02-28 0.059464 0.106713
2022-03-31 0.069969 0.094989
2022-04-30 0.061336 0.076258
2022-05-31 0.140671 0.060263
2022-06-30 0.141022 0.056075
2022-07-31 0.135400 0.049517
2022-08-31 0.145573 0.038228
2022-09-30 0.186490 0.025632
2022-10-31 0.188271 0.012812
2022-11-30 0.187484 0.000112
2022-12-31 0.090576 NaN
2023-01-31 0.076000 NaN
2023-02-28 NaN NaN
2023-03-31 NaN NaN
2023-04-30 NaN NaN
2023-05-31 NaN NaN
2023-06-30 NaN NaN
2023-07-31 NaN NaN
2023-08-31 NaN NaN
The following options won't work for me:
- dropna
- specifying a range a-la df=df[:'2023-01-31'].pct_change(12)
I need to keep the index past the last data point and I am doing a lot of different pct_changes so specifying a range everytime is too time consuming, I am looking for a nicer solution.
Also, specifying a range won't work because when when the df gets new data points, I will have to manually adjust all of the ranges, which is no bueno.
CodePudding user response:
If you look at the documentation for df.pct_change
, you'll find that it has a parameter fill_method
that uses pad
as the default for handling NaN values before computing the changes. pad
(or ffill
) means that the function is propagating the last valid observation forward. E.g. in your first column, with periods=12
, when you reach the index value 2023-02-28
, instead of using the NaN value, the function picks the last valid value (103.522
at 2023-01-31
) to do the calculation:
103.522/96.390-1
# 0.0739910779126467 (96.390 being the value at `2022-02-28`, one year earlier)
To avoid this, you want to set fill_method
to bfill
(cf. the documentation for df.fillna
for these methods). Result examplified below with periods=3
(I trust you are only showing the last rows of a longer df
, because otherwise all the values before 2023-01-31
should be NaN values):
df.pct_change(periods=3, fill_method='bfill')
col1 col2
2022-01-31 NaN NaN
2022-02-28 NaN NaN
2022-03-31 NaN NaN
2022-04-30 0.025153 -0.000245
2022-05-31 0.076294 -0.002727
2022-06-30 0.052233 -0.006086
2022-07-31 0.065984 -0.000379
2022-08-31 0.016444 -0.000804
2022-09-30 0.070177 -0.004827
2022-10-31 0.062841 -0.010228
2022-11-30 0.057193 -0.012953
2022-12-31 -0.045245 NaN
2023-01-31 -0.073587 NaN
2023-02-28 NaN NaN
2023-03-31 NaN NaN
2023-04-30 NaN NaN
2023-05-31 NaN NaN
2023-06-30 NaN NaN
2023-07-31 NaN NaN
2023-08-31 NaN NaN
CodePudding user response:
You can put ptc_change(limit=1)
, which will stop comparison further after 1 null comparison.