Home > Net >  Cumsum starting value not 0
Cumsum starting value not 0

Time:09-21

I have the following dataframe df of athletes (indexed by Athlete_ID) and their Rank (indexed by Rank), here is a slide of the dataframe with a particular athlete:

Rank    Athlete_ID  Date
13      143         25/4/2021
1       143         5/4/2021
6       143         24/2/2021
11      143         24/1/2021
4       143         1/1/2021
9       143         13/12/2020
8       143         22/11/2020
1       143         23/9/2020
9       143         6/9/2020
10      143         20/5/2020
1       143         18/3/2020
7       143         26/2/2020
1       143         29/1/2020
1       143         18/12/2019
1       143         20/11/2019
7       143         2/3/2019
4       143         10/2/2019
7       143         27/6/2018
9       143         6/5/2018
2       143         7/1/2018
2       143         17/12/2017
1       143         5/11/2017
3       143         8/10/2017

I want to count the total number of wins (rank number 1) and recent number of wins (number of wins in this year), here is my code:

df['Athle_total_wins']=df.sort_values(['Athlete_ID','Date'],ascending=[True,True])['Rank'].shift(1).eq(1).groupby(df['Athlete_ID']).cumsum()
 
df['Athle_recent_wins']=df.sort_values(['Athlete_ID','Date'],ascending=[True,True])['Rank'].shift(1).eq(1).groupby([df['Athlete_ID'],df['Date'].dt.year]).cumsum()
 

and the output is

Rank    Athlete_ID  Date        Athle_total_wins    Athle_recent_wins
13      143         25/4/2021   8                   1
1       143         5/4/2021    7                   0
6       143         24/2/2021   7                   0
11      143         24/1/2021   7                   0
4       143         1/1/2021    7                   0
9       143         13/12/2020  7                   4
8       143         22/11/2020  7                   4
1       143         23/9/2020   6                   3
9       143         6/9/2020    6                   3
10      143         20/5/2020   6                   3
1       143         18/3/2020   5                   2
7       143         26/2/2020   5                   2
1       143         29/1/2020   4                   1
1       143         18/12/2019  3                   1
1       143         20/11/2019  2                   0
7       143         2/3/2019    2                   0
4       143         10/2/2019   2                   0
7       143         27/6/2018   2                   0
9       143         6/5/2018    2                   0
2       143         7/1/2018    2                   0
2       143         17/12/2017  2                   2
1       143         5/11/2017   1                   1
3       143         8/10/2017   1                   1

which is almost what I want but not exactly because the last row doesn't start at 0 (i.e. something is wrong for the last year in 2017, or the last few rows). The desired output should be

Rank    Athlete_ID  Date        Athle_total_wins    Athle_recent_wins
13      143         25/4/2021   7                   1
1       143         5/4/2021    6                   0
6       143         24/2/2021   6                   0
11      143         24/1/2021   6                   0
4       143         1/1/2021    6                   0
9       143         13/12/2020  6                   4
8       143         22/11/2020  6                   4
1       143         23/9/2020   5                   3
9       143         6/9/2020    5                   3
10      143         20/5/2020   5                   3
1       143         18/3/2020   4                   2
7       143         26/2/2020   4                   2
1       143         29/1/2020   3                   1
1       143         18/12/2019  2                   1
1       143         20/11/2019  1                   0
7       143         2/3/2019    1                   0
4       143         10/2/2019   1                   0
7       143         27/6/2018   1                   0
9       143         6/5/2018    1                   0
2       143         7/1/2018    1                   0
2       143         17/12/2017  1                   1
1       143         5/11/2017   0                   1
3       143         8/10/2017   0                   0

CodePudding user response:

You need to perform the shift inside the groupby. Also if wasn't clear whether the Date was datetime so I refactored the code a bit:

date = pd.to_datetime(df['Date'], dayfirst=True)
win = (df
 .assign(date=date)                          
 .sort_values(['Athlete_ID','date'],ascending=[True,True])
 ['Rank'].eq(1)
)

df['Athle_total_wins'] = (win
 .groupby(df['Athlete_ID']).apply(lambda g: g.shift(1, fill_value=0).cumsum())
)

df['Athle_recent_wins'] = (win
 .groupby([df['Athlete_ID'], date.dt.year]).apply(lambda g: g.shift(1, fill_value=0).cumsum())
)

output:

    Rank  Athlete_ID        Date Athle_total_wins Athle_recent_wins
0     13         143   25/4/2021                7                 1
1      1         143    5/4/2021                6                 0
2      6         143   24/2/2021                6                 0
3     11         143   24/1/2021                6                 0
4      4         143    1/1/2021                6                 0
5      9         143  13/12/2020                6                 3
6      8         143  22/11/2020                6                 3
7      1         143   23/9/2020                5                 2
8      9         143    6/9/2020                5                 2
9     10         143   20/5/2020                5                 2
10     1         143   18/3/2020                4                 1
11     7         143   26/2/2020                4                 1
12     1         143   29/1/2020                3                 0
13     1         143  18/12/2019                2                 1
14     1         143  20/11/2019                1                 0
15     7         143    2/3/2019                1                 0
16     4         143   10/2/2019                1                 0
17     7         143   27/6/2018                1                 0
18     9         143    6/5/2018                1                 0
19     2         143    7/1/2018                1                 0
20     2         143  17/12/2017                1                 1
21     1         143   5/11/2017                0                 0
22     3         143   8/10/2017                0                 0
  • Related