Pandas Rolling With MultiIndex and GroupBy-CodePudding

I am looking at creating a rolling sum of the past n results for a given id. The index of the DataFrame is the id and date.

The code below works for non-time-based rolling windows, i.e. integers. However, does not work for time-based intervals, such as '10D' (10 days).

df2['rolling_sum'] = df2.groupby(['id'])['a_column_to_rolling_sum'].apply(lambda x: x.rolling(2, 1).sum().shift())

>> Rolling Result
id               date_dt   
-2143487296      2019-07-08         NaN
                 2019-07-15    0.104478
                 2019-07-19    0.217260
-2143477291      2019-07-05         NaN
                 2019-07-10    0.238764
                 2019-07-16    0.391669
                 2019-07-22    0.255469
                 2019-07-29    0.244011

The code below is able to almost get what I want, however, when a new group is reached, it should be NaN as there cannot be a previous sum.

rolling_result = (
    df2
    .reset_index(level=0)
    .groupby('id')['a_column_to_rolling_sum']
    .rolling('10D', min_periods=1)
    .sum()
    .shift(1)
)

# Add to df
df2['rolling_sum'] = rolling_result

>> Rolling Result
id               date_dt   
-2143487296      2019-07-08         NaN
                 2019-07-15    0.104478
                 2019-07-19    0.217260
-2143477291      2019-07-05    0.229506  <- Why is it not NaN!
                 2019-07-10    0.238764
                 2019-07-16    0.391669
                 2019-07-22    0.255469
                 2019-07-29    0.244011

Image of the two columns where the left is what I want and the right is what I get from 2.

To recap: I want to group by multiple columns, including an id and date. For each of these groups, I want to create a rolling sum of the previous n days ('10D') and m occurrences (integer value), such that the start of each group is NaN.

Thank you very much!

CodePudding user response：

You should remove the min_periods=1 argument from rolling and this will give you what you seek (for reference, the min_periods argument is documented as "Minimum number of observations in window required to have a value; otherwise, result is np.nan.")

CodePudding user response：

The problem is occurring during the shift. The rolling result needs to be grouped again, i.e.

rolling_result = (
    df2
    .reset_index(level=0)
    .groupby('id')['a_column_to_rolling_sum']
    .rolling('10D', min_periods=1)
    .sum()
    .groupby('id')
    .shift(1)
)

# Add to df
df2['rolling_sum'] = rolling_result

This works with multiple groupby arguments too. If you have other indices, remove all bar the date, i.e., .reset_index(level=(0,1,...)).