Home > Mobile >  New column indicating preference by calculating conditional probability in Pandas Dataframe
New column indicating preference by calculating conditional probability in Pandas Dataframe

Time:10-02

I have a huge dataframe of race data of athletes that looks like the following:

Race_ID  Athlete_ID  Distance  Rank
1        1           100       3
2        1           400       6
3        1           1500      1
4        1           100       6
5        1           100       1
6        1           1500      1
7        2           100       1
8        2           400       2
9        2           400       1
10       2           1500      6
11       2           1500      4
12       2           100       1
13       2           100       1

and I want to make a new column called Dist_Preference to see what distance (100 (short), 400 (middle), 1500 (long)) the athlete is more good at by calculating the conditional probability of the athlete of winning that distance. So for example:

P(athlete 1 winning a 100m race | it is a 100m race) = 2/3

P(athlete 1 winning a 400m race | it is a 400m race) = 0/1 = 0

P(athlete 1 winning a 1500m race | it is a 1500m race) = 2/2 = 1

Hence athlete 1 clearly has a preference of running long distance. and so fourth, and the desired column looks like

Race_ID  Athlete_ID  Distance  Rank  Dist_Preference
1        1           100       3     1500
2        1           400       6     1500     
3        1           1500      1     1500
4        1           100       6     1500
5        1           100       1     1500
6        1           1500      1     1500
7        2           100       1     100
8        2           400       2     100
9        2           400       1     100
10       2           1500      6     100
11       2           1500      4     100
12       2           100       1     100
13       2           100       1     100

I have written a subroutine to do this by defining a method that calculates the conditional probabilities as above and then use .apply but it is incredibly slow for large dataframe, I wonder if there is any quick way to do this.

Thank you so much.

CodePudding user response:

Create a boolean column: win to identify the winning races, then pivot the dataframe with aggfunc mean which will calculate the probabilities, then use idxmax to find the preferred distance based on winning probability

df['win'] = df['Rank'] == 1
prob = df.pivot_table('win', 'Athlete_ID', 'Distance')
df['preferred_dist'] = df['Athlete_ID'].map(prob.idxmax(1))

Result

    Race_ID  Athlete_ID  Distance  Rank    win  preferred_dist
0         1           1       100     3  False            1500
1         2           1       400     6  False            1500
2         3           1      1500     1   True            1500
3         4           1       100     6  False            1500
4         5           1       100     1   True            1500
5         6           1      1500     1   True            1500
6         7           2       100     1   True             100
7         8           2       400     2  False             100
8         9           2       400     1   True             100
9        10           2      1500     6  False             100
10       11           2      1500     4  False             100
11       12           2       100     1   True             100
12       13           2       100     1   True             100

CodePudding user response:

try:

df
Race_ID  Athlete_ID  Distance  Rank
1        1           100       3
2        1           400       6
3        1           1500      1
4        1           100       6
5        1           100       1
6        1           1500      1
7        2           100       1
8        2           400       2
9        2           400       1
10       2           1500      6
11       2           1500      4
12       2           100       1
13       2           100       1

df1 = df.groupby(['Athlete_ID', 'Distance']).agg(nbr_of_races=('Rank', lambda x: len(x)),
     nbr_of_races_won=('Rank', lambda x: len([i for i in x if i == 1]))).reset_index()

df1['performance'] = (df1['nbr_of_races_won']*100/df1['nbr_of_races']).round(2)

df1
    Athlete_ID  Distance    nbr_of_races    nbr_of_races_won    performance
0   1           100         3               1                   33.33
1   1           400         1               0                   0.00
2   1           1500        2               2                   100.00
3   2           100         3               3                   100.00
4   2           400         2               1                   50.00
5   2           1500        2               0                   0.00

#sort by performance
df1 = df1.sort_values(['Athlete_ID', 'performance'], ascending=[True, False])

    Athlete_ID  Distance    nbr_of_races    nbr_of_races_won    performance
2   1           1500        2               2                   100.00
0   1           100         3               1                   33.33
1   1           400         1               0                   0.00
3   2           100         3               3                   100.00
4   2           400         2               1                   50.00
5   2           1500        2               0                   0.00

#keep only the best performed Distance for each Athlete
df1 = df1.drop_duplicates('Athlete_ID') 

    Athlete_ID  Distance    nbr_of_races    nbr_of_races_won    performance
2   1           1500        2               2                   100.0
3   2           100         3               3                   100.0

#merge 
df = df.merge(df1.loc[:, ['Athlete_ID', 'Distance']].rename(columns={"Distance": "preferred_dist"}), on='Athlete_ID', how='left')

df

    Race_ID Athlete_ID  Distance    Rank    preferred_dist
0   1       1           100         3       1500
1   2       1           400         6       1500
2   3       1           1500        1       1500
3   4       1           100         6       1500
4   5       1           100         1       1500
5   6       1           1500        1       1500
6   7       2           100         1       100
7   8       2           400         2       100
8   9       2           400         1       100
9   10      2           1500        6       100
10  11      2           1500        4       100
11  12      2           100         1       100
12  13      2           100         1       100


  • Related