I have the following dataframe df
of athletes (indexed by Athlete_ID
) and their Rank (indexed by Rank
), here is a slide of the dataframe with a particular athlete:
Rank Athlete_ID Date
13 143 25/4/2021
1 143 5/4/2021
6 143 24/2/2021
11 143 24/1/2021
4 143 1/1/2021
9 143 13/12/2020
8 143 22/11/2020
1 143 23/9/2020
9 143 6/9/2020
10 143 20/5/2020
1 143 18/3/2020
7 143 26/2/2020
1 143 29/1/2020
1 143 18/12/2019
1 143 20/11/2019
7 143 2/3/2019
4 143 10/2/2019
7 143 27/6/2018
9 143 6/5/2018
2 143 7/1/2018
2 143 17/12/2017
1 143 5/11/2017
3 143 8/10/2017
I want to count the total number of wins (rank number 1) and recent number of wins (number of wins in this year), here is my code:
df['Athle_total_wins']=df.sort_values(['Athlete_ID','Date'],ascending=[True,True])['Rank'].shift(1).eq(1).groupby(df['Athlete_ID']).cumsum()
df['Athle_recent_wins']=df.sort_values(['Athlete_ID','Date'],ascending=[True,True])['Rank'].shift(1).eq(1).groupby([df['Athlete_ID'],df['Date'].dt.year]).cumsum()
and the output is
Rank Athlete_ID Date Athle_total_wins Athle_recent_wins
13 143 25/4/2021 8 1
1 143 5/4/2021 7 0
6 143 24/2/2021 7 0
11 143 24/1/2021 7 0
4 143 1/1/2021 7 0
9 143 13/12/2020 7 4
8 143 22/11/2020 7 4
1 143 23/9/2020 6 3
9 143 6/9/2020 6 3
10 143 20/5/2020 6 3
1 143 18/3/2020 5 2
7 143 26/2/2020 5 2
1 143 29/1/2020 4 1
1 143 18/12/2019 3 1
1 143 20/11/2019 2 0
7 143 2/3/2019 2 0
4 143 10/2/2019 2 0
7 143 27/6/2018 2 0
9 143 6/5/2018 2 0
2 143 7/1/2018 2 0
2 143 17/12/2017 2 2
1 143 5/11/2017 1 1
3 143 8/10/2017 1 1
which is almost what I want but not exactly because the last row doesn't start at 0 (i.e. something is wrong for the last year in 2017, or the last few rows). The desired output should be
Rank Athlete_ID Date Athle_total_wins Athle_recent_wins
13 143 25/4/2021 7 1
1 143 5/4/2021 6 0
6 143 24/2/2021 6 0
11 143 24/1/2021 6 0
4 143 1/1/2021 6 0
9 143 13/12/2020 6 4
8 143 22/11/2020 6 4
1 143 23/9/2020 5 3
9 143 6/9/2020 5 3
10 143 20/5/2020 5 3
1 143 18/3/2020 4 2
7 143 26/2/2020 4 2
1 143 29/1/2020 3 1
1 143 18/12/2019 2 1
1 143 20/11/2019 1 0
7 143 2/3/2019 1 0
4 143 10/2/2019 1 0
7 143 27/6/2018 1 0
9 143 6/5/2018 1 0
2 143 7/1/2018 1 0
2 143 17/12/2017 1 1
1 143 5/11/2017 0 1
3 143 8/10/2017 0 0
CodePudding user response:
You need to perform the shift
inside the groupby
. Also if wasn't clear whether the Date was datetime so I refactored the code a bit:
date = pd.to_datetime(df['Date'], dayfirst=True)
win = (df
.assign(date=date)
.sort_values(['Athlete_ID','date'],ascending=[True,True])
['Rank'].eq(1)
)
df['Athle_total_wins'] = (win
.groupby(df['Athlete_ID']).apply(lambda g: g.shift(1, fill_value=0).cumsum())
)
df['Athle_recent_wins'] = (win
.groupby([df['Athlete_ID'], date.dt.year]).apply(lambda g: g.shift(1, fill_value=0).cumsum())
)
output:
Rank Athlete_ID Date Athle_total_wins Athle_recent_wins
0 13 143 25/4/2021 7 1
1 1 143 5/4/2021 6 0
2 6 143 24/2/2021 6 0
3 11 143 24/1/2021 6 0
4 4 143 1/1/2021 6 0
5 9 143 13/12/2020 6 3
6 8 143 22/11/2020 6 3
7 1 143 23/9/2020 5 2
8 9 143 6/9/2020 5 2
9 10 143 20/5/2020 5 2
10 1 143 18/3/2020 4 1
11 7 143 26/2/2020 4 1
12 1 143 29/1/2020 3 0
13 1 143 18/12/2019 2 1
14 1 143 20/11/2019 1 0
15 7 143 2/3/2019 1 0
16 4 143 10/2/2019 1 0
17 7 143 27/6/2018 1 0
18 9 143 6/5/2018 1 0
19 2 143 7/1/2018 1 0
20 2 143 17/12/2017 1 1
21 1 143 5/11/2017 0 0
22 3 143 8/10/2017 0 0