I have a data frame as follows.
df_sample=pd.DataFrame({'ID':['ID1','ID2','ID2','ID2','ID1','ID2','ID1','ID1'],
"quarter":['2016Q1','2016Q1','2016Q1','2017Q1','2017Q1','2018Q1','2018Q2','2018Q3'],
"product":['productA','productB','productA','productD','productA','productA','productD','ProductA'],
"sales":[100,200,100,400,100,500,400,100]})
I want to get the top product based on the cumulative sales amount for each ID. i.e. for ID1 for the 2018Q1 quarter, I want to take the sum of each product sold for all data <=2018Q1 and return the product name for each ID. Thanks in advance.
Expected output:
pd.DataFrame({'ID':['ID1','ID1','ID1','ID1', 'ID2','ID2','ID2'],
"quarter":['2016Q1','2017Q1','2018Q2','2018Q3','2016Q1','2017Q1','2018Q1'],
"product":['productA','productA','productD','productD','productB','ProductD','productA']})
CodePudding user response:
IIUC, you can use a double groupby
:
(df_sample
.groupby(['ID', 'quarter', 'product'])['sales'].sum()
.unstack('product', fill_value=0)
.groupby('ID').cumsum()
.idxmax(1)
)
output:
ID quarter
ID1 2016Q1 productA
2017Q1 productA
2018Q2 productD
2018Q3 productD
ID2 2016Q1 productB
2017Q1 productD
2018Q1 productA
dtype: object