I have the following Pandas dataframe:
Fruit Color Version
Lemon Green 1.4a
Lemon Yellow 2.5.10
Lemon Blue 6.7.a
Apple Green 1.1
Banana Yellow 2.5.8
Banana Black 6.8.a
I would like to group by "Color" column and sort by "Version" column in descending order and than I would like to add another column called "highest" that indicates whether the version in this row is the highest among the color group, the desired output is:
Fruit Color Version highest
Lemon Green 1.4a 1
Apple Green 1.1 0
Banana Yellow 2.5.10 1
Lemon Yellow 2.5.8 0
Banana Black 6.8.a 1
Lemon Blue 6.7.a 1
My biggest problem is correctly sorting by the Version column since using
df = df.sort_values(by=['Color', 'Version'], ascending=False)
Would return a wrong answer - For the color "Yellow" Version 2.5.8 would be above 2.5.10 which is false
CodePudding user response:
You can use natsort
here:
df['highest']=df.groupby('Color')['Version'].transform(lambda x: natsorted(list(x),reverse=True)[0])
df['highest']=(df['highest']==df['Version']).astype(int)
df = df.sort_values(by=['Color', 'highest'], ascending=False)
Output:
Fruit Color Version highest
1 Lemon Yellow 2.5.10 1
4 Banana Yellow 2.5.8 0
0 Lemon Green 1.4a 1
3 Apple Green 1.1 0
2 Lemon Blue 6.7.a 1
5 Banana Black 6.8.a 1