Home > Software design >  How do I use the Pandas groupby function to calculate the mean for the previous year?
How do I use the Pandas groupby function to calculate the mean for the previous year?

Time:03-04

I am trying to look for a method to find a Player's mean score for the "Last Season" (Previous Year) and add it in a new column in the original dataframe df.

I have coded a formula to get a Player's mean score for the current year, excluding the current row, which is as follows:

df['Season Avg'] = df.groupby([df['Player'], df['DateTime'].dt.year])['Score']
                   .apply(lambda x: x.shift(1).expanding().mean())

However, despite my best attempt at using the shift function, I can not quite work out how to calculate the previous years mean ("Last Season Avg") directly into a new column.

The dataframe is set out as follows:

Player DateTime Score Season Avg
PlayerB 2020-MM-DD HH:MM:SS 40 NaN
PlayerA 2020-MM-DD HH:MM:SS 50 NaN
PlayerA 2021-MM-DD HH:MM:SS 100 NaN
PlayerB 2021-MM-DD HH:MM:SS 200 NaN
PlayerA 2021-MM-DD HH:MM:SS 160 100
PlayerB 2021-MM-DD HH:MM:SS 140 200
PlayerB 2021-MM-DD HH:MM:SS 160 170
PlayerA 2021-MM-DD HH:MM:SS 200 130

The new ideal dataframe that I would like:

Player DateTime Score Season Avg Last Season Avg
PlayerB 2020-MM-DD HH:MM:SS 40 NaN NaN
PlayerA 2020-MM-DD HH:MM:SS 50 NaN NaN
PlayerA 2021-MM-DD HH:MM:SS 100 NaN 50
PlayerB 2021-MM-DD HH:MM:SS 200 NaN 40
PlayerA 2021-MM-DD HH:MM:SS 160 100 50
PlayerB 2021-MM-DD HH:MM:SS 140 200 40
PlayerB 2021-MM-DD HH:MM:SS 160 170 40
PlayerA 2021-MM-DD HH:MM:SS 200 130 50

CodePudding user response:

You can groupby once by the "Player" and the year to find the yearly average for each player; then groupby "Player" shift to get the previous year's previous year's averages.

out = df.groupby(['Player', df['DateTime'].dt.year])['Score'].mean().reset_index(name='Season Avg')
out['Last Season Avg'] = out['Season Avg'].groupby('Player').shift()

If you're looking for career averages until a particular season, you could use expanding().mean():

out = df.groupby(['Player', df['DateTime'].dt.year])['Score'].expanding().mean().reset_index(name='Season Avg')
df['Career Avg by Last Season'] = df['Career Avg by Season'].groupby('Player').shift()

CodePudding user response:

Create a sample data set

import pandas
import numpy as np
df = pandas.DataFrame(
    {'player': ['B', 'A', 'A', 'B', 'A', 'B', 'B', 'A'],
     'datetime': ['2020-01-01', '2020-01-01', '2021-01-01', '2021-01-01',
                  '2021-01-01', '2021-01-01', '2021-01-01', '2021-01-01'],
     'score': [40, 50, 100, 200, 160, 140, 160, 200],
    }
)
df["datetime"] = pandas.to_datetime(df["datetime"])
df["year"] = df["datetime"].dt.year

Use transform to add the current season average to the data frame

df["season_avg"] = df.groupby(["datetime", "player"])["score"].transform("mean")
df

  player   datetime  score  year  season_avg
0      B 2020-01-01     40  2020   40.000000
1      A 2020-01-01     50  2020   50.000000
2      A 2021-01-01    100  2021  153.333333
3      B 2021-01-01    200  2021  166.666667
4      A 2021-01-01    160  2021  153.333333
5      B 2021-01-01    140  2021  166.666667
6      B 2021-01-01    160  2021  166.666667
7      A 2021-01-01    200  2021  153.333333

Shift cannot be applied here because years are repeated

df.sort_values(["year"], ascending=True).groupby(["player"])["season_avg"].transform("shift")

0           NaN
1           NaN
2     50.000000
3     40.000000
4    153.333333
5    166.666667
6    166.666667
7    153.333333
Name: season_avg, dtype: float64

Compute the average from the previous year and join them to the original dataframe

savg = (df.groupby(["year", "player"])
        .agg(last_season_avg = ("score", "mean"))
        .reset_index())
savg["year"] = savg["year"]   1
savg

   year player  last_season_avg
0  2021      A        50.000000
1  2021      B        40.000000
2  2022      A       153.333333
3  2022      B       166.666667

df.merge(savg, on=["player", "year"], how="left" )

  player   datetime  score  year  season_avg  last_season_avg
0      B 2020-01-01     40  2020   40.000000              NaN
1      A 2020-01-01     50  2020   50.000000              NaN
2      A 2021-01-01    100  2021  153.333333             50.0
3      B 2021-01-01    200  2021  166.666667             40.0
4      A 2021-01-01    160  2021  153.333333             50.0
5      B 2021-01-01    140  2021  166.666667             40.0
6      B 2021-01-01    160  2021  166.666667             40.0
7      A 2021-01-01    200  2021  153.333333             50.0

Another way to compute the average from the previous year, using shift is maybe more elegant than doing year 1.

savg = (df.groupby(["year", "player"])
        .agg(season_avg = ("score", "mean"))
        .reset_index()
        .sort_values(["year"])
       )
savg["last_season_avg"] = savg.groupby(["player"])["season_avg"].transform("shift")
  • Related