I want to create dataframe showing sales for a given year. The end goal is to show three products with the highest sales in the entire year, but with figures broken down by quarter.
I created dataframe result
based on another dataframe df
containing relevant data
result = df.groupby(['Product', 'Order Date']).sum()
Quantity Ordered
Product Order Date
Product 1 2019Q1 787
2019Q2 1067
2019Q3 865
2019Q4 1418
... ... ...
Product 90 2019Q1 1356
2019Q2 1884
2019Q3 1338
2019Q4 2287
Now I want to sort it so that products with the highest total 'Quantity Ordered' are on top, and then take only 3 top products.
If it's possible to sort it before groupby that would also work, as long as the end goal is reached and the resulting dataframe looks the same as the one shown above but sorted and limited to the top 3 products.
I'm asking because I couldn't find any solution to this specific case or anything similar enough to be applicable.
CodePudding user response:
IIUC, you need a second groupby after sorting the values:
(df
# group (as columns) and sum
.groupby(['Product', 'Order Date'], as_index=False).sum()
# now sort the values
.sort_values(by='Quantity Ordered')
# keep the top 3 per date
.groupby('Order Date').head(3)
)
NB. untested as no reproducible input was provided
CodePudding user response:
I ended up determining top three products before groupby and simply filtered out irrelevant products before grouping.
totalByProduct = (frame).groupby(['Product']).sum()
totalByProduct.sort_values(by=['Quantity Ordered'], ascending=False, inplace=True)
topProductsList = list(totalByProduct.head(3).index)
result = df[df['Product'].isin(topProductsList)].groupby(['Product', 'Order Date']).sum()