Efficient way to generate new columns of the minimum value of certain subset of another column in Pa-CodePudding

I have a dataframe of performance of athletes in different races that looks like

Race_ID  Date       Athlete_ID Time  Rank
1        2022-10-12 1          55    3
1        2022-10-12 2          52    2
1        2022-10-12 3          48    1
1        2022-10-12 4          58    5
1        2022-10-12 5          59    6
1        2022-10-12 6          57    4
2        2022-09-17 1          43    2         
2        2022-09-17 2          48    4
2        2022-09-17 3          42    1
2        2022-09-17 4          50    5
2        2022-09-17 5          44    3
3        2022-08-11 1          56    4
3        2022-08-11 2          55    3
3        2022-08-11 3          51    2
3        2022-08-11 4          50    1
4        2022-05-30 1          43    2
4        2022-05-30 2          44    3
4        2022-05-30 3          40    1
4        2022-05-30 4          49    6
4        2022-05-30 5          48    5
4        2022-05-30 6          47    4

and I sort it according to Athlete_ID and Date:

df.sort_values(by=['Athlete_ID','Date], ascending=[True,True], inplace=True) and I get

Race_ID  Date       Athlete_ID Time  Rank
4        2022-05-30 1          43    2
3        2022-08-11 1          56    4
2        2022-09-17 1          43    2  
1        2022-10-12 1          55    3
4        2022-05-30 2          44    3
3        2022-08-11 2          55    3
2        2022-09-17 2          48    4
1        2022-10-12 2          52    2
4        2022-05-30 3          40    1
3        2022-08-11 3          51    2
2        2022-09-17 3          42    1
1        2022-10-12 3          48    1
4        2022-05-30 4          49    6
3        2022-08-11 4          50    1
2        2022-09-17 4          50    5
1        2022-10-12 4          58    5
4        2022-05-30 5          48    5
2        2022-09-17 5          44    3
1        2022-10-12 5          59    6
4        2022-05-30 6          47    4
1        2022-10-12 6          57    4

For each Athlete_ID, I want to generate a new column Minimum_time@t-1 whose value is the minimum time of the LAST race that the athlete ran, and 0 otherwise, so the desired output looks like:

Race_ID  Date       Athlete_ID Time  Rank Minimum_time@t-1
4        2022-05-30 1          43    2    0   #since that's the first race athlete1 ran
3        2022-08-11 1          56    4    40  #the last race athlete1 ran is race 4 and the fastest time is 40
2        2022-09-17 1          43    2    50  #the last race athlete1 ran is race 3 and the fastest time is 50
1        2022-10-12 1          55    3    42
4        2022-05-30 2          44    3    0
3        2022-08-11 2          55    3    40
2        2022-09-17 2          48    4    50
1        2022-10-12 2          52    2    42
4        2022-05-30 3          40    1    0
3        2022-08-11 3          51    2    40
2        2022-09-17 3          42    1    50
1        2022-10-12 3          48    1    42
4        2022-05-30 4          49    6    0  
3        2022-08-11 4          50    1    40
2        2022-09-17 4          50    5    50
1        2022-10-12 4          58    5    42
4        2022-05-30 5          48    5    0     
2        2022-09-17 5          44    3    40
1        2022-10-12 5          59    6    42
4        2022-05-30 6          47    4    0   #since that's the first race athlete6 ran
1        2022-10-12 6          57    4    40  #the last race athlete6 ran is race 4 and the fastest time is 40

The way I did this is to first define a funtion:

def minimum_time(Race_ID):
  return df.loc[df['Race_ID] == Race_ID]['Time'].min()

and then use shift to get the Race_ID for the last race of the athletes and then apply minimum_time to it:

df.sort_values(by=['Athlete_ID','Date'], ascending=[True,False], inplace=True)
df['Race_ID@t-1'] = df.groupby('Athlete_ID')['Race_ID'].shift(-1).replace(np.nan, 0)
df['Minimum_time@t-1'] = df['Race_ID@t-1'].map(minimum_time).replace(np.nan, 0)

So it works but it's very slow for large datasets. I wanna ask is there a more computationally efficient way to do this? Thank you.

CodePudding user response：

I would use a different method to get the minimum time, and fillna instead of replace:

# get min time per race
best = df.groupby('Race_ID')['Time'].min()

# shift to get the previous race
# map best time for this race, then fill NaNs with 0
df['Rank Minimum_time@t-1'] = (df.groupby('Athlete_ID')['Race_ID']
                                .shift(1).map(best)
                                .fillna(0, downcast='infer')
                              )

output:

    Race_ID        Date  Athlete_ID  Time  Rank  Rank Minimum_time@t-1
15        4  2022-05-30           1    43     2                      0
11        3  2022-08-11           1    56     4                     40
6         2  2022-09-17           1    43     2                     50
0         1  2022-10-12           1    55     3                     42
16        4  2022-05-30           2    44     3                      0
12        3  2022-08-11           2    55     3                     40
7         2  2022-09-17           2    48     4                     50
1         1  2022-10-12           2    52     2                     42
17        4  2022-05-30           3    40     1                      0
13        3  2022-08-11           3    51     2                     40
8         2  2022-09-17           3    42     1                     50
2         1  2022-10-12           3    48     1                     42
18        4  2022-05-30           4    49     6                      0
14        3  2022-08-11           4    50     1                     40
9         2  2022-09-17           4    50     5                     50
3         1  2022-10-12           4    58     5                     42
19        4  2022-05-30           5    48     5                      0
10        2  2022-09-17           5    44     3                     40
4         1  2022-10-12           5    59     6                     42
20        4  2022-05-30           6    47     4                      0
5         1  2022-10-12           6    57     4                     40