Home > Back-end >  Efficient way to generate new columns of the minimum value of certain subset of another column in Pa
Efficient way to generate new columns of the minimum value of certain subset of another column in Pa

Time:10-02

I have a dataframe of performance of athletes in different races that looks like

Race_ID  Date       Athlete_ID Time  Rank
1        2022-10-12 1          55    3
1        2022-10-12 2          52    2
1        2022-10-12 3          48    1
1        2022-10-12 4          58    5
1        2022-10-12 5          59    6
1        2022-10-12 6          57    4
2        2022-09-17 1          43    2         
2        2022-09-17 2          48    4
2        2022-09-17 3          42    1
2        2022-09-17 4          50    5
2        2022-09-17 5          44    3
3        2022-08-11 1          56    4
3        2022-08-11 2          55    3
3        2022-08-11 3          51    2
3        2022-08-11 4          50    1
4        2022-05-30 1          43    2
4        2022-05-30 2          44    3
4        2022-05-30 3          40    1
4        2022-05-30 4          49    6
4        2022-05-30 5          48    5
4        2022-05-30 6          47    4

and I sort it according to Athlete_ID and Date:

df.sort_values(by=['Athlete_ID','Date], ascending=[True,True], inplace=True) and I get

Race_ID  Date       Athlete_ID Time  Rank
4        2022-05-30 1          43    2
3        2022-08-11 1          56    4
2        2022-09-17 1          43    2  
1        2022-10-12 1          55    3
4        2022-05-30 2          44    3
3        2022-08-11 2          55    3
2        2022-09-17 2          48    4
1        2022-10-12 2          52    2
4        2022-05-30 3          40    1
3        2022-08-11 3          51    2
2        2022-09-17 3          42    1
1        2022-10-12 3          48    1
4        2022-05-30 4          49    6
3        2022-08-11 4          50    1
2        2022-09-17 4          50    5
1        2022-10-12 4          58    5
4        2022-05-30 5          48    5
2        2022-09-17 5          44    3
1        2022-10-12 5          59    6
4        2022-05-30 6          47    4
1        2022-10-12 6          57    4

For each Athlete_ID, I want to generate a new column Minimum_time@t-1 whose value is the minimum time of the LAST race that the athlete ran, and 0 otherwise, so the desired output looks like:

Race_ID  Date       Athlete_ID Time  Rank Minimum_time@t-1
4        2022-05-30 1          43    2    0   #since that's the first race athlete1 ran
3        2022-08-11 1          56    4    40  #the last race athlete1 ran is race 4 and the fastest time is 40
2        2022-09-17 1          43    2    50  #the last race athlete1 ran is race 3 and the fastest time is 50
1        2022-10-12 1          55    3    42
4        2022-05-30 2          44    3    0
3        2022-08-11 2          55    3    40
2        2022-09-17 2          48    4    50
1        2022-10-12 2          52    2    42
4        2022-05-30 3          40    1    0
3        2022-08-11 3          51    2    40
2        2022-09-17 3          42    1    50
1        2022-10-12 3          48    1    42
4        2022-05-30 4          49    6    0  
3        2022-08-11 4          50    1    40
2        2022-09-17 4          50    5    50
1        2022-10-12 4          58    5    42
4        2022-05-30 5          48    5    0     
2        2022-09-17 5          44    3    40
1        2022-10-12 5          59    6    42
4        2022-05-30 6          47    4    0   #since that's the first race athlete6 ran
1        2022-10-12 6          57    4    40  #the last race athlete6 ran is race 4 and the fastest time is 40

The way I did this is to first define a funtion:

def minimum_time(Race_ID):
  return df.loc[df['Race_ID] == Race_ID]['Time'].min()

and then use shift to get the Race_ID for the last race of the athletes and then apply minimum_time to it:

df.sort_values(by=['Athlete_ID','Date'], ascending=[True,False], inplace=True)
df['Race_ID@t-1'] = df.groupby('Athlete_ID')['Race_ID'].shift(-1).replace(np.nan, 0)
df['Minimum_time@t-1'] = df['Race_ID@t-1'].map(minimum_time).replace(np.nan, 0)

So it works but it's very slow for large datasets. I wanna ask is there a more computationally efficient way to do this? Thank you.

CodePudding user response:

I would use a different method to get the minimum time, and fillna instead of replace:

# get min time per race
best = df.groupby('Race_ID')['Time'].min()

# shift to get the previous race
# map best time for this race, then fill NaNs with 0
df['Rank Minimum_time@t-1'] = (df.groupby('Athlete_ID')['Race_ID']
                                .shift(1).map(best)
                                .fillna(0, downcast='infer')
                              )

output:

    Race_ID        Date  Athlete_ID  Time  Rank  Rank Minimum_time@t-1
15        4  2022-05-30           1    43     2                      0
11        3  2022-08-11           1    56     4                     40
6         2  2022-09-17           1    43     2                     50
0         1  2022-10-12           1    55     3                     42
16        4  2022-05-30           2    44     3                      0
12        3  2022-08-11           2    55     3                     40
7         2  2022-09-17           2    48     4                     50
1         1  2022-10-12           2    52     2                     42
17        4  2022-05-30           3    40     1                      0
13        3  2022-08-11           3    51     2                     40
8         2  2022-09-17           3    42     1                     50
2         1  2022-10-12           3    48     1                     42
18        4  2022-05-30           4    49     6                      0
14        3  2022-08-11           4    50     1                     40
9         2  2022-09-17           4    50     5                     50
3         1  2022-10-12           4    58     5                     42
19        4  2022-05-30           5    48     5                      0
10        2  2022-09-17           5    44     3                     40
4         1  2022-10-12           5    59     6                     42
20        4  2022-05-30           6    47     4                      0
5         1  2022-10-12           6    57     4                     40
  • Related