Home > Software design >  Rolling sum across subsets
Rolling sum across subsets

Time:05-27

I have a data set with values ('ee') of interest by quarter, and multiple fields which indicate the amount of ee in various subsets of the population, as follows:

test=pd.DataFrame(data={'cyq':['2018Q1']*3 ['2018Q2']*3 ['2018Q3']*3 ['2018Q4']*3,
                   'species':['canine','canine','feline']*4,
                   'group':['a','b','a']*4,
                    'ee':range(12)})

I'm trying to get a rolling sum of ee over the quarters, specific to each unique subset of the other fields, in this case species, and group. In my actual data set, I have six identifying fields in total.

The following:

test.groupby(['cyq','species','group']).ee.rolling(window=2).sum()

is producing all NaNs. Other solutions which I've found end up rolling the sum within the each quarter, or only on one identifying field. My goal is to take a rolling sum that recognizes canine a, canine b, and feline a as distinct and rolls their ee values by quarter:

enter image description here

Thank you so much for the help. I feel like there's a simple solution, but variations on this question I've found here aren't working on my dataset.

CodePudding user response:

I don't think you got anything from your first code because there is nothing that would come back based on that grouping. However, I was able to get pretty close to what you were looking for by removing the grouping.

test=pd.DataFrame(data={'cyq':['2018Q1']*3 ['2018Q2']*3 ['2018Q3']*3 ['2018Q4']*3,
                   'species':['canine','canine','feline']*4,
                   'group':['a','b','a']*4,
                    'ee':range(12)})
test['ee'] = test['ee'].rolling(window = 2).sum()
test

CodePudding user response:

You don't get any results because in your provided example after the groupby each group is exactly 1 row. Your rolling window is 2. Look at the documentation for pd.rolling, quote from min_periods.

min_periods : int, default None

Minimum number of observations in window required to have a value; otherwise, result is np.nan. For a window that is specified by an integer, min_periods will default to the size of the window.

Since you never have a window of two all values return NaN. If your real data is bigger and each group has more values, your code will work. You could set min_periods to 1 so you will get a return value if there is at least one value.

  • Related