Groupby: Getting 'Count' and 'Percentage' In Same Table-CodePudding

I'm working on the tips dataset:

# Load the data
tips = sns.load_dataset("tips")
tips.head()

total_bill  tip     sex  smoker day  time  size
0   16.99   1.01    Female  No  Sun Dinner  2
1   10.34   1.66    Male    No  Sun Dinner  3
2   21.01   3.50    Male    No  Sun Dinner  3
3   23.68   3.31    Male    No  Sun Dinner  2
4   24.59   3.61    Female  No  Sun Dinner  4

I want to groupby the time and smoker columns and then count the number of smokers in each group (including their percentage). I'm trying to achieve that in 2 steps:

Step 1:

# Count by groups
tips.groupby(['time', 'smoker']).agg({'smoker': 'count'})
               smoker
time    smoker  
Lunch    Yes    23
         No     45
Dinner   Yes    70
         No    106

Step 2:

# Count with percentage
tips.groupby(['time', 'smoker']).agg({'smoker': 'count'}).transform(lambda x: x/x.sum()*100)
                 smoker
time    smoker  
Lunch    Yes    9.426230
          No    18.442623
Dinner   Yes    28.688525
          No    43.442623

I'm getting the results in 2 separate tables. I want to get the count as well as percentage in the same table.

Desired output:

               count   percentage
time    smoker  
Lunch    Yes    23     9.426230
         No     45     18.442623
Dinner   Yes    70     28.688525
         No    106     43.442623

Is there any better way of doing it? Any suggestions would be appreciated. Thanks!

CodePudding user response：

You can use the transform to compute the percentage in another instruction and store the result in another column.

tips_grouped = tips.groupby(['time', 'smoker']).agg({'smoker': 'count'}).to_frame()
tips_grouped["percentage"] = tips_grouped["count"].transform(lambda x: x/x.sum()*100)

CodePudding user response：

You can do multiple aggregations on the same column using named aggregation:

output = df.groupby(['time', 'smoker'])["smoker"].agg(Count="count", Percentage=lambda x: x.count()/df.shape[0]*100)

>>>                Count  Percentage
time   smoker                   
Dinner No        106   43.442623
       Yes        70   28.688525
Lunch  No         45   18.442623
       Yes        23    9.426230

Input df:

import random
df = pd.DataFrame({"time": ["Lunch"]*68 ["Dinner"]*176,
                   "smoker": ["Yes"]*23 ["No"]*45 ["Yes"]*70 ["No"]*106,
                   "sex": random.choices(["Male","Female"],k=100)})