I am looking at creating a rolling sum of the past n
results for a given id
. The index of the DataFrame is the id
and date
.
The code below works for non-time-based rolling windows, i.e. integers. However, does not work for time-based intervals, such as '10D'
(10 days).
1.
df2['rolling_sum'] = df2.groupby(['id'])['a_column_to_rolling_sum'].apply(lambda x: x.rolling(2, 1).sum().shift())
>> Rolling Result
id date_dt
-2143487296 2019-07-08 NaN
2019-07-15 0.104478
2019-07-19 0.217260
-2143477291 2019-07-05 NaN
2019-07-10 0.238764
2019-07-16 0.391669
2019-07-22 0.255469
2019-07-29 0.244011
The code below is able to almost get what I want, however, when a new group is reached, it should be NaN
as there cannot be a previous sum.
2.
rolling_result = (
df2
.reset_index(level=0)
.groupby('id')['a_column_to_rolling_sum']
.rolling('10D', min_periods=1)
.sum()
.shift(1)
)
# Add to df
df2['rolling_sum'] = rolling_result
>> Rolling Result
id date_dt
-2143487296 2019-07-08 NaN
2019-07-15 0.104478
2019-07-19 0.217260
-2143477291 2019-07-05 0.229506 <- Why is it not NaN!
2019-07-10 0.238764
2019-07-16 0.391669
2019-07-22 0.255469
2019-07-29 0.244011
Image of the two columns where the left is what I want and the right is what I get from 2.
To recap: I want to group by multiple columns, including an id
and date
. For each of these groups, I want to create a rolling sum of the previous n
days ('10D'
) and m
occurrences (integer value), such that the start of each group is NaN
.
Thank you very much!
CodePudding user response:
You should remove the min_periods=1
argument from rolling
and this will give you what you seek (for reference, the min_periods
argument is documented as "Minimum number of observations in window required to have a value; otherwise, result is np.nan.")
CodePudding user response:
The problem is occurring during the shift. The rolling result needs to be grouped again, i.e.
rolling_result = (
df2
.reset_index(level=0)
.groupby('id')['a_column_to_rolling_sum']
.rolling('10D', min_periods=1)
.sum()
.groupby('id')
.shift(1)
)
# Add to df
df2['rolling_sum'] = rolling_result
This works with multiple groupby
arguments too. If you have other indices, remove all bar the date, i.e., .reset_index(level=(0,1,...))
.