I have a dataframe of performance of athletes in different races that looks like
Race_ID Date Athlete_ID Time Rank
1 2022-10-12 1 55 3
1 2022-10-12 2 52 2
1 2022-10-12 3 48 1
1 2022-10-12 4 58 5
1 2022-10-12 5 59 6
1 2022-10-12 6 57 4
2 2022-09-17 1 43 2
2 2022-09-17 2 48 4
2 2022-09-17 3 42 1
2 2022-09-17 4 50 5
2 2022-09-17 5 44 3
3 2022-08-11 1 56 4
3 2022-08-11 2 55 3
3 2022-08-11 3 51 2
3 2022-08-11 4 50 1
4 2022-05-30 1 43 2
4 2022-05-30 2 44 3
4 2022-05-30 3 40 1
4 2022-05-30 4 49 6
4 2022-05-30 5 48 5
4 2022-05-30 6 47 4
and I sort it according to Athlete_ID
and Date
:
df.sort_values(by=['Athlete_ID','Date], ascending=[True,True], inplace=True)
and I get
Race_ID Date Athlete_ID Time Rank
4 2022-05-30 1 43 2
3 2022-08-11 1 56 4
2 2022-09-17 1 43 2
1 2022-10-12 1 55 3
4 2022-05-30 2 44 3
3 2022-08-11 2 55 3
2 2022-09-17 2 48 4
1 2022-10-12 2 52 2
4 2022-05-30 3 40 1
3 2022-08-11 3 51 2
2 2022-09-17 3 42 1
1 2022-10-12 3 48 1
4 2022-05-30 4 49 6
3 2022-08-11 4 50 1
2 2022-09-17 4 50 5
1 2022-10-12 4 58 5
4 2022-05-30 5 48 5
2 2022-09-17 5 44 3
1 2022-10-12 5 59 6
4 2022-05-30 6 47 4
1 2022-10-12 6 57 4
For each Athlete_ID
, I want to generate a new column Minimum_time@t-1
whose value is the minimum time of the LAST race that the athlete ran, and 0 otherwise, so the desired output looks like:
Race_ID Date Athlete_ID Time Rank Minimum_time@t-1
4 2022-05-30 1 43 2 0 #since that's the first race athlete1 ran
3 2022-08-11 1 56 4 40 #the last race athlete1 ran is race 4 and the fastest time is 40
2 2022-09-17 1 43 2 50 #the last race athlete1 ran is race 3 and the fastest time is 50
1 2022-10-12 1 55 3 42
4 2022-05-30 2 44 3 0
3 2022-08-11 2 55 3 40
2 2022-09-17 2 48 4 50
1 2022-10-12 2 52 2 42
4 2022-05-30 3 40 1 0
3 2022-08-11 3 51 2 40
2 2022-09-17 3 42 1 50
1 2022-10-12 3 48 1 42
4 2022-05-30 4 49 6 0
3 2022-08-11 4 50 1 40
2 2022-09-17 4 50 5 50
1 2022-10-12 4 58 5 42
4 2022-05-30 5 48 5 0
2 2022-09-17 5 44 3 40
1 2022-10-12 5 59 6 42
4 2022-05-30 6 47 4 0 #since that's the first race athlete6 ran
1 2022-10-12 6 57 4 40 #the last race athlete6 ran is race 4 and the fastest time is 40
The way I did this is to first define a funtion:
def minimum_time(Race_ID):
return df.loc[df['Race_ID] == Race_ID]['Time'].min()
and then use shift
to get the Race_ID
for the last race of the athletes and then apply minimum_time
to it:
df.sort_values(by=['Athlete_ID','Date'], ascending=[True,False], inplace=True)
df['Race_ID@t-1'] = df.groupby('Athlete_ID')['Race_ID'].shift(-1).replace(np.nan, 0)
df['Minimum_time@t-1'] = df['Race_ID@t-1'].map(minimum_time).replace(np.nan, 0)
So it works but it's very slow for large datasets. I wanna ask is there a more computationally efficient way to do this? Thank you.
CodePudding user response:
I would use a different method to get the minimum time, and fillna
instead of replace:
# get min time per race
best = df.groupby('Race_ID')['Time'].min()
# shift to get the previous race
# map best time for this race, then fill NaNs with 0
df['Rank Minimum_time@t-1'] = (df.groupby('Athlete_ID')['Race_ID']
.shift(1).map(best)
.fillna(0, downcast='infer')
)
output:
Race_ID Date Athlete_ID Time Rank Rank Minimum_time@t-1
15 4 2022-05-30 1 43 2 0
11 3 2022-08-11 1 56 4 40
6 2 2022-09-17 1 43 2 50
0 1 2022-10-12 1 55 3 42
16 4 2022-05-30 2 44 3 0
12 3 2022-08-11 2 55 3 40
7 2 2022-09-17 2 48 4 50
1 1 2022-10-12 2 52 2 42
17 4 2022-05-30 3 40 1 0
13 3 2022-08-11 3 51 2 40
8 2 2022-09-17 3 42 1 50
2 1 2022-10-12 3 48 1 42
18 4 2022-05-30 4 49 6 0
14 3 2022-08-11 4 50 1 40
9 2 2022-09-17 4 50 5 50
3 1 2022-10-12 4 58 5 42
19 4 2022-05-30 5 48 5 0
10 2 2022-09-17 5 44 3 40
4 1 2022-10-12 5 59 6 42
20 4 2022-05-30 6 47 4 0
5 1 2022-10-12 6 57 4 40