Home > Back-end >  how to calculate mean values of the past n years grouped by two variables
how to calculate mean values of the past n years grouped by two variables

Time:03-18

First off I would like to say I checked out [this answer][1] and I couldn't progress with the information there.

So I have a dataset as such

df = pd.DataFrame({'ID': [10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013],
'Type': ['Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue'],
'Year': [2018,2018,2019,2019,2020,2020,2021,2021,2022,2022,2018,2018,2019,2019,2021,2021,2018,2018,2019,2019,2020,2020,2021,2021,2022,2022,2018,2018,2019,2019,2021,2021],
'Score': [0,0,0,0,0,0,0,0,0,0,14,24,16,5,87,33,0,0,0,0,0,0,0,0,0,0,11,13,3,16,37,49]})

I don't know how to type in NaN so I make it like this:

df.replace(0, np.nan, inplace=True)
df = df.dropna(axis=0, subset=['Score'])

I want to take the rolling N (In this case 3) Year average of score by ID and Subject.

I can get the 3 year rolling average score by ID:

df['average_past_3_years'] = df.groupby(['ID'], as_index = False).rolling(3).agg( {'Score':'mean', 'Year': 'max'}).reset_index(level=0).groupby( 'Year').transform('shift')['Score'] df = df.sort_values(['ID', 'Year'])

That gives me the rolling average but only by year on ID, not subject as well:

ID     Type   Year  Score   average_past_3_years
10010   Red   2018  14.0    NaN
10010   Blue  2018  24.0    NaN
10010   Red   2019  16.0    NaN
10010   Blue  2019  5.0     18.000000
10010   Red   2021  87.0    NaN
10010   Blue  2021  33.0    36.000000
10013   Red   2018  11.0    NaN
10013   Blue  2018  13.0    NaN
10013   Red   2019  3.0     15.000000
10013   Blue  2019  16.0    9.000000
10013   Red   2021  37.0    41.666667
10013   Blue  2021  49.0    18.666667

I am trying to out put say 10013 for red the 2021 would be 17, not 41 as it would only be calculating the red scores.

I tried:

df['average_past_3_years'] = df.groupby([['ID', 'Type']], as_index = False).rolling(3).agg(
                      {'Score':'mean', 'Year': 'max'}).reset_index(level=0).groupby(
                      'Year').transform('shift')['Score']

but got this error:

Grouper and axis must be same length

And kinda got stuck there

I'm also unsure if I need to sort before hand

  [1]: https://stackoverflow.com/questions/61837481/pandas-how-to-calculate-mean-values-of-the-past-n-years-for-every-month

CodePudding user response:

IIUC use:

df1 = df.groupby(['ID','Type']).rolling(3).agg( {'Score':'mean', 'Year': 'max'}).droplevel([0,1])

df = df.join(df1.add_suffix('_r'))
df['Score_r'] = df.groupby('Year')['Score_r'].shift()
  • Related