Home > other >  Marking the highest value in a group in Pandas dataframe
Marking the highest value in a group in Pandas dataframe

Time:12-26

I have the following Pandas dataframe:

Fruit   Color   Version
Lemon   Green   1.4a
Lemon   Yellow  2.5.10
Lemon   Blue    6.7.a
Apple   Green   1.1
Banana  Yellow  2.5.8
Banana  Black   6.8.a

I would like to group by "Color" column and sort by "Version" column in descending order and than I would like to add another column called "highest" that indicates whether the version in this row is the highest among the color group, the desired output is:

Fruit   Color   Version highest
Lemon   Green   1.4a    1
Apple   Green   1.1     0
Banana  Yellow  2.5.10  1
Lemon   Yellow  2.5.8   0
Banana  Black   6.8.a   1
Lemon   Blue    6.7.a   1

My biggest problem is correctly sorting by the Version column since using

df = df.sort_values(by=['Color', 'Version'], ascending=False)

Would return a wrong answer - For the color "Yellow" Version 2.5.8 would be above 2.5.10 which is false

CodePudding user response:

You can use natsort here:

df['highest']=df.groupby('Color')['Version'].transform(lambda x: natsorted(list(x),reverse=True)[0])
df['highest']=(df['highest']==df['Version']).astype(int)
df = df.sort_values(by=['Color', 'highest'], ascending=False)

Output:

    Fruit   Color Version  highest
1   Lemon  Yellow  2.5.10        1
4  Banana  Yellow   2.5.8        0
0   Lemon   Green    1.4a        1
3   Apple   Green     1.1        0
2   Lemon    Blue   6.7.a        1
5  Banana   Black   6.8.a        1
  • Related