My goal is to group the df
by the 'Grouping' column, and then find the max sales and the corresponding SKU. There a lot of sources that explain how to do the groupby
and find max sales, but adding this additional SKU column is where I am unable to find any sources that explain what I am doing wrong.
As you can see below, I am successfully grouping by the 'Grouping' column, and displaying the max Sales per group, but the SKU is incorrect. The SKU with the highest sales for group 1 is 'E2-MKEP', and my result is 'V4-DE5U'.
import pandas as pd
list1 = [
['1', 'V4-DE5U', 956.64],
['1', 'DH-Q9OY', 642.43],
['1', 'E2-MKEP', 1071.6],
['2', 'WL-NOLZ', 389.06],
['2', 'JF-4E3C', 162.69],
['3', 'N9-DABP', 618.96],
['3', 'OO-JBHE', 1451.19],
]
cols = ['Grouping', 'SKU', 'Sales']
df = pd.DataFrame(list1, columns = cols)
df1 = df.groupby(['Grouping']).agg(max)[['Sales', 'SKU']]
print(df1)
result:
Sales SKU
Grouping
1 1071.60 V4-DE5U
2 389.06 WL-NOLZ
3 1451.19 OO-JBHE
CodePudding user response:
You could transform max
and create a boolean mask and filter df
:
msk = df.groupby(['Grouping'])['Sales'].transform('max') == df['Sales']
out = df.loc[msk, ['Sales', 'SKU']]
Output:
Sales SKU
2 1071.60 E2-MKEP
3 389.06 WL-NOLZ
6 1451.19 OO-JBHE
If you want the "Grouping" column as well, you could use msk
on df
without selecting columns:
out = df[msk]
Output:
Grouping SKU Sales
2 1 E2-MKEP 1071.60
3 2 WL-NOLZ 389.06
6 3 OO-JBHE 1451.19
CodePudding user response:
You can sort the "Sales" values and then drop the "Grouping" duplicates
df1 = df1.sort_values('Sales', ascending=False).drop_duplicates(['Grouping']).sort_values('Grouping')
df1.reset_index(drop=True, inplace=True)
>>> df1
Grouping SKU Sales
0 1 E2-MKEP 1071.60
1 2 WL-NOLZ 389.06
2 3 OO-JBHE 1451.19