Most efficient way to take max of classifier scores in Python and / or PySpark-CodePudding

I have a dataframe with the scores of a two-class classification model...

Observation	Class	Probability
1	0	0.5013
1	1	0.4987
2	0	0.5010
2	1	0.4990
3	0	0.5128
3	1	0.4872

I only care about the "winning" class (either 0 or 1) and its corresponding probability (the max. probability). What is the best way to group or modify this dataframe to only have 3 observations (in this case) with the "winning" class (0 or 1) and the "winning" probability?

For example, my desired output...

Observation	Class	Probability
1	0	0.5013
2	0	0.5010
3	0	0.5128

CodePudding user response：

To select some rows in a df by some rule you can use .loc or .query:

timing: 300 µs ± 6 µs per loop

df.loc[df['Class']==0]

timing: 1.17 ms ± 53 µs per loop

df.query('Class == 0')

CodePudding user response：

Pandas allows filtering the dataframe in an argmax()-like fashion with respect to the Probability column by sorting in an ascending fashion via sort_values and retrieving those row indices that coincide with the highest (predicted) Probability per Observation index. Here's the code

df = df.loc[df.drop('Class', axis=1).sort_values(by = ['Observation', 'Probability'], ascending = [True, False])[['Observation']].drop_duplicates(keep="first").index]

and yields

df
>     Observation  Class  Probability
 0            1      0       0.5013
 2            2      0       0.5010
 4            3      0       0.5128