I am trying to look for a method to find a Player's mean score for the "Last Season"
(Previous Year) and add it in a new column in the original dataframe df
.
I have coded a formula to get a Player's mean score for the current year, excluding the current row, which is as follows:
df['Season Avg'] = df.groupby([df['Player'], df['DateTime'].dt.year])['Score']
.apply(lambda x: x.shift(1).expanding().mean())
However, despite my best attempt at using the shift
function, I can not quite work out how to calculate the previous years mean ("Last Season Avg"
) directly into a new column.
The dataframe is set out as follows:
Player | DateTime | Score | Season Avg |
---|---|---|---|
PlayerB | 2020-MM-DD HH:MM:SS | 40 | NaN |
PlayerA | 2020-MM-DD HH:MM:SS | 50 | NaN |
PlayerA | 2021-MM-DD HH:MM:SS | 100 | NaN |
PlayerB | 2021-MM-DD HH:MM:SS | 200 | NaN |
PlayerA | 2021-MM-DD HH:MM:SS | 160 | 100 |
PlayerB | 2021-MM-DD HH:MM:SS | 140 | 200 |
PlayerB | 2021-MM-DD HH:MM:SS | 160 | 170 |
PlayerA | 2021-MM-DD HH:MM:SS | 200 | 130 |
The new ideal dataframe that I would like:
Player | DateTime | Score | Season Avg | Last Season Avg |
---|---|---|---|---|
PlayerB | 2020-MM-DD HH:MM:SS | 40 | NaN | NaN |
PlayerA | 2020-MM-DD HH:MM:SS | 50 | NaN | NaN |
PlayerA | 2021-MM-DD HH:MM:SS | 100 | NaN | 50 |
PlayerB | 2021-MM-DD HH:MM:SS | 200 | NaN | 40 |
PlayerA | 2021-MM-DD HH:MM:SS | 160 | 100 | 50 |
PlayerB | 2021-MM-DD HH:MM:SS | 140 | 200 | 40 |
PlayerB | 2021-MM-DD HH:MM:SS | 160 | 170 | 40 |
PlayerA | 2021-MM-DD HH:MM:SS | 200 | 130 | 50 |
CodePudding user response:
You can groupby
once by the "Player" and the year to find the yearly average for each player; then groupby
"Player" shift
to get the previous year's previous year's averages.
out = df.groupby(['Player', df['DateTime'].dt.year])['Score'].mean().reset_index(name='Season Avg')
out['Last Season Avg'] = out['Season Avg'].groupby('Player').shift()
If you're looking for career averages until a particular season, you could use expanding().mean()
:
out = df.groupby(['Player', df['DateTime'].dt.year])['Score'].expanding().mean().reset_index(name='Season Avg')
df['Career Avg by Last Season'] = df['Career Avg by Season'].groupby('Player').shift()
CodePudding user response:
Create a sample data set
import pandas
import numpy as np
df = pandas.DataFrame(
{'player': ['B', 'A', 'A', 'B', 'A', 'B', 'B', 'A'],
'datetime': ['2020-01-01', '2020-01-01', '2021-01-01', '2021-01-01',
'2021-01-01', '2021-01-01', '2021-01-01', '2021-01-01'],
'score': [40, 50, 100, 200, 160, 140, 160, 200],
}
)
df["datetime"] = pandas.to_datetime(df["datetime"])
df["year"] = df["datetime"].dt.year
Use transform to add the current season average to the data frame
df["season_avg"] = df.groupby(["datetime", "player"])["score"].transform("mean")
df
player datetime score year season_avg
0 B 2020-01-01 40 2020 40.000000
1 A 2020-01-01 50 2020 50.000000
2 A 2021-01-01 100 2021 153.333333
3 B 2021-01-01 200 2021 166.666667
4 A 2021-01-01 160 2021 153.333333
5 B 2021-01-01 140 2021 166.666667
6 B 2021-01-01 160 2021 166.666667
7 A 2021-01-01 200 2021 153.333333
Shift cannot be applied here because years are repeated
df.sort_values(["year"], ascending=True).groupby(["player"])["season_avg"].transform("shift")
0 NaN
1 NaN
2 50.000000
3 40.000000
4 153.333333
5 166.666667
6 166.666667
7 153.333333
Name: season_avg, dtype: float64
Compute the average from the previous year and join them to the original dataframe
savg = (df.groupby(["year", "player"])
.agg(last_season_avg = ("score", "mean"))
.reset_index())
savg["year"] = savg["year"] 1
savg
year player last_season_avg
0 2021 A 50.000000
1 2021 B 40.000000
2 2022 A 153.333333
3 2022 B 166.666667
df.merge(savg, on=["player", "year"], how="left" )
player datetime score year season_avg last_season_avg
0 B 2020-01-01 40 2020 40.000000 NaN
1 A 2020-01-01 50 2020 50.000000 NaN
2 A 2021-01-01 100 2021 153.333333 50.0
3 B 2021-01-01 200 2021 166.666667 40.0
4 A 2021-01-01 160 2021 153.333333 50.0
5 B 2021-01-01 140 2021 166.666667 40.0
6 B 2021-01-01 160 2021 166.666667 40.0
7 A 2021-01-01 200 2021 153.333333 50.0
Another way to compute the average from the previous year, using shift
is maybe more elegant than doing year 1
.
savg = (df.groupby(["year", "player"])
.agg(season_avg = ("score", "mean"))
.reset_index()
.sort_values(["year"])
)
savg["last_season_avg"] = savg.groupby(["player"])["season_avg"].transform("shift")