I have a huge dataframe of race data of athletes that looks like the following:
Race_ID Athlete_ID Distance Rank
1 1 100 3
2 1 400 6
3 1 1500 1
4 1 100 6
5 1 100 1
6 1 1500 1
7 2 100 1
8 2 400 2
9 2 400 1
10 2 1500 6
11 2 1500 4
12 2 100 1
13 2 100 1
and I want to make a new column called Dist_Preference
to see what distance (100 (short), 400 (middle), 1500 (long)) the athlete is more good at by calculating the conditional probability of the athlete of winning that distance. So for example:
P(athlete 1 winning a 100m race | it is a 100m race) = 2/3
P(athlete 1 winning a 400m race | it is a 400m race) = 0/1 = 0
P(athlete 1 winning a 1500m race | it is a 1500m race) = 2/2 = 1
Hence athlete 1 clearly has a preference of running long distance. and so fourth, and the desired column looks like
Race_ID Athlete_ID Distance Rank Dist_Preference
1 1 100 3 1500
2 1 400 6 1500
3 1 1500 1 1500
4 1 100 6 1500
5 1 100 1 1500
6 1 1500 1 1500
7 2 100 1 100
8 2 400 2 100
9 2 400 1 100
10 2 1500 6 100
11 2 1500 4 100
12 2 100 1 100
13 2 100 1 100
I have written a subroutine to do this by defining a method that calculates the conditional probabilities as above and then use .apply
but it is incredibly slow for large dataframe, I wonder if there is any quick way to do this.
Thank you so much.
CodePudding user response:
Create a boolean column: win
to identify the winning races, then pivot the dataframe with aggfunc mean
which will calculate the probabilities, then use idxmax
to find the preferred distance based on winning probability
df['win'] = df['Rank'] == 1
prob = df.pivot_table('win', 'Athlete_ID', 'Distance')
df['preferred_dist'] = df['Athlete_ID'].map(prob.idxmax(1))
Result
Race_ID Athlete_ID Distance Rank win preferred_dist
0 1 1 100 3 False 1500
1 2 1 400 6 False 1500
2 3 1 1500 1 True 1500
3 4 1 100 6 False 1500
4 5 1 100 1 True 1500
5 6 1 1500 1 True 1500
6 7 2 100 1 True 100
7 8 2 400 2 False 100
8 9 2 400 1 True 100
9 10 2 1500 6 False 100
10 11 2 1500 4 False 100
11 12 2 100 1 True 100
12 13 2 100 1 True 100
CodePudding user response:
try:
df
Race_ID Athlete_ID Distance Rank
1 1 100 3
2 1 400 6
3 1 1500 1
4 1 100 6
5 1 100 1
6 1 1500 1
7 2 100 1
8 2 400 2
9 2 400 1
10 2 1500 6
11 2 1500 4
12 2 100 1
13 2 100 1
df1 = df.groupby(['Athlete_ID', 'Distance']).agg(nbr_of_races=('Rank', lambda x: len(x)),
nbr_of_races_won=('Rank', lambda x: len([i for i in x if i == 1]))).reset_index()
df1['performance'] = (df1['nbr_of_races_won']*100/df1['nbr_of_races']).round(2)
df1
Athlete_ID Distance nbr_of_races nbr_of_races_won performance
0 1 100 3 1 33.33
1 1 400 1 0 0.00
2 1 1500 2 2 100.00
3 2 100 3 3 100.00
4 2 400 2 1 50.00
5 2 1500 2 0 0.00
#sort by performance
df1 = df1.sort_values(['Athlete_ID', 'performance'], ascending=[True, False])
Athlete_ID Distance nbr_of_races nbr_of_races_won performance
2 1 1500 2 2 100.00
0 1 100 3 1 33.33
1 1 400 1 0 0.00
3 2 100 3 3 100.00
4 2 400 2 1 50.00
5 2 1500 2 0 0.00
#keep only the best performed Distance for each Athlete
df1 = df1.drop_duplicates('Athlete_ID')
Athlete_ID Distance nbr_of_races nbr_of_races_won performance
2 1 1500 2 2 100.0
3 2 100 3 3 100.0
#merge
df = df.merge(df1.loc[:, ['Athlete_ID', 'Distance']].rename(columns={"Distance": "preferred_dist"}), on='Athlete_ID', how='left')
df
Race_ID Athlete_ID Distance Rank preferred_dist
0 1 1 100 3 1500
1 2 1 400 6 1500
2 3 1 1500 1 1500
3 4 1 100 6 1500
4 5 1 100 1 1500
5 6 1 1500 1 1500
6 7 2 100 1 100
7 8 2 400 2 100
8 9 2 400 1 100
9 10 2 1500 6 100
10 11 2 1500 4 100
11 12 2 100 1 100
12 13 2 100 1 100