I have a Dataframe with 4 columns: 'ID' (clients), 'item', 'tier' (high/low), 'units' (number). Now for each item and each tier I would like to find the total units and how many clients bough at least one item for each tier. I do this with
df.groupby(['item','tier']).agg(
ID_amount=('ID', 'size'),
total_units=('units', 'sum'))
item tier ID_amount total_units
100010001 high 83 178,871.00
low 153 1,450,986.00
100010002 high 722 10,452,778.00
low 911 5,505,136.00
100020001 high 400 876,490.00
low 402 962,983.00
100020002 high 4933 61,300,403.00
low 13759 1,330,932,723.00
100020003 high 15063 176,846,161.00
low 24905 288,232,057.00
What I would like is to have another column that represents the percentage of the 'total_units' column. When I try
df.groupby(['item','tier']).agg(
ID_amount=('ID', 'size'),
total_units=('units', 'sum'),
percen_units=('units', lambda x: 100*x/x.sum())
it gives the error Must produce aggregated value. How can I modify my code to give me those percentages? Thanks
CodePudding user response:
I think you want this:
dfs = df.groupby(['item','tier']).agg(
ID_amount=('ID', 'size'),
total_units=('units', 'sum'))
dfs['percent_units'] = dfs.groupby('item')['total_units']\
.transform(lambda x: x/x.sum()*100)
dfs