I have a dataframe with the scores of a two-class classification model...
Observation | Class | Probability |
---|---|---|
1 | 0 | 0.5013 |
1 | 1 | 0.4987 |
2 | 0 | 0.5010 |
2 | 1 | 0.4990 |
3 | 0 | 0.5128 |
3 | 1 | 0.4872 |
I only care about the "winning" class (either 0 or 1) and its corresponding probability (the max. probability). What is the best way to group or modify this dataframe to only have 3 observations (in this case) with the "winning" class (0 or 1) and the "winning" probability?
For example, my desired output...
Observation | Class | Probability |
---|---|---|
1 | 0 | 0.5013 |
2 | 0 | 0.5010 |
3 | 0 | 0.5128 |
CodePudding user response:
To select some rows in a df by some rule you can use .loc or .query:
timing: 300 µs ± 6 µs per loop
df.loc[df['Class']==0]
timing: 1.17 ms ± 53 µs per loop
df.query('Class == 0')
CodePudding user response:
Pandas allows filtering the dataframe in an argmax()
-like fashion with respect to the Probability
column by sorting in an ascending fashion via sort_values
and retrieving those row indices that coincide with the highest (predicted) Probability
per Observation
index. Here's the code
df = df.loc[df.drop('Class', axis=1).sort_values(by = ['Observation', 'Probability'], ascending = [True, False])[['Observation']].drop_duplicates(keep="first").index]
and yields
df
> Observation Class Probability
0 1 0 0.5013
2 2 0 0.5010
4 3 0 0.5128