Pandas seems to use previous rows instead of the next rows when it creates rolling windows. To illustrate, I will use this data:
import pandas as pd
df = pd.DataFrame({
'date': pd.date_range('2022-01-01', '2022-01-07'),
'num': [20, 27, 19, 23, 24, 26, 21],
}).set_index('date')
num
date
2022-01-01 20
2022-01-02 27
2022-01-03 19
2022-01-04 23
2022-01-05 24
2022-01-06 26
2022-01-07 21
I am trying to create a new column that for each day, calculates the maximum percentage increase of the next two days relative to that day. For example, on 2022-01-01
, within a window of the next two days, the number has increased from 20
to 27
, which is a 35% increase. This is the code:
df['max_increase'] = df['num'].rolling(3).apply(
lambda xs: max(xs[1:] - xs[0]) / xs[0]
)
num max_increase
date
2022-01-01 20 NaN
2022-01-02 27 NaN
2022-01-03 19 0.350000
2022-01-04 23 -0.148148
2022-01-05 24 0.263158
2022-01-06 26 0.130435
2022-01-07 21 0.083333
However, that is not what I want. I do not want the NaN
values. I want the 35% increase (i.e. 0.350000
) to be associated with 2022-01-01
, not with 2022-01-03
. Note that df['max_increase'] = df['max_increase'].shift(-2)
will not fix the problem, since it will produce NaN
values for the last two rows. This is what I want:
num max_increase
date
2022-01-01 20 0.350000 # Increase from 20 to 27.
2022-01-02 27 -0.148148 # Decrease from 27 to 23.
2022-01-03 19 0.263158 # Increase from 19 to 24.
2022-01-04 23 0.130435 # Increase from 23 to 26.
2022-01-05 24 0.083333 # Increase from 24 to 26.
2022-01-06 26 -0.192307 # Decrease from 26 to 21.
2022-01-07 21 0.000000
It appears the the problem stems from the rolling windows that Pandas created. It appears that Pandas created these 3-day windows using previous rows instead of future rows:
2022-01-01
2022-01-01
,2022-01-02
2022-01-01
,2022-01-02
,2022-01-03
2022-01-02
,2022-01-03
,2022-01-04
2022-01-03
,2022-01-04
,2022-01-05
2022-01-04
,2022-01-05
,2022-01-06
2022-01-05
,2022-01-06
,2022-01-07
How do I make Pandas create a forward rolling window instead? i.e.
2022-01-01
,2022-01-02
,2022-01-03
2022-01-02
,2022-01-03
,2022-01-04
2022-01-03
,2022-01-04
,2022-01-05
2022-01-04
,2022-01-05
,2022-01-06
2022-01-05
,2022-01-06
,2022-01-07
2022-01-06
,2022-01-07
2022-01-07
CodePudding user response:
Use a shifted rolling max:
df['max_increase'] = (df.loc[::-1, 'num']
.rolling(3, min_periods=1)
.apply(lambda x: x.shift().max())
.sub(df['num'])
.div(df['num']).fillna(0)
)
output:
num max_increase
date
2022-01-01 20 0.350000
2022-01-02 27 -0.148148
2022-01-03 19 0.263158
2022-01-04 23 0.130435
2022-01-05 24 0.083333
2022-01-06 26 -0.192308
2022-01-07 21 0.000000
CodePudding user response:
Another option is to append the last two rows to the original dataframe and shifting the results (after applying rolling
) afterwards:
n = 3 # window size
df['max_increase'] = (pd.concat([df, pd.concat([df[-1:]]*(n-1))])['num']
.rolling(n).apply(lambda xs: max(xs[1:] - xs[0]) / xs[0])
.shift(-(n-1)).dropna())
Output:
num max_increase
date
2022-01-01 20 0.350000
2022-01-02 27 -0.148148
2022-01-03 19 0.263158
2022-01-04 23 0.130435
2022-01-05 24 0.083333
2022-01-06 26 -0.192308
2022-01-07 21 0.000000