I'm working on the tips
dataset:
# Load the data
tips = sns.load_dataset("tips")
tips.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
I want to groupby
the time
and smoker
columns and then count the number of smokers in each group (including their percentage). I'm trying to achieve that in 2 steps:
Step 1:
# Count by groups
tips.groupby(['time', 'smoker']).agg({'smoker': 'count'})
smoker
time smoker
Lunch Yes 23
No 45
Dinner Yes 70
No 106
Step 2:
# Count with percentage
tips.groupby(['time', 'smoker']).agg({'smoker': 'count'}).transform(lambda x: x/x.sum()*100)
smoker
time smoker
Lunch Yes 9.426230
No 18.442623
Dinner Yes 28.688525
No 43.442623
I'm getting the results in 2 separate tables. I want to get the count
as well as percentage
in the same table.
Desired output:
count percentage
time smoker
Lunch Yes 23 9.426230
No 45 18.442623
Dinner Yes 70 28.688525
No 106 43.442623
Is there any better way of doing it? Any suggestions would be appreciated. Thanks!
CodePudding user response:
You can use the transform
to compute the percentage in another instruction and store the result in another column.
tips_grouped = tips.groupby(['time', 'smoker']).agg({'smoker': 'count'}).to_frame()
tips_grouped["percentage"] = tips_grouped["count"].transform(lambda x: x/x.sum()*100)
CodePudding user response:
You can do multiple aggregations on the same column using named aggregation:
output = df.groupby(['time', 'smoker'])["smoker"].agg(Count="count", Percentage=lambda x: x.count()/df.shape[0]*100)
>>> Count Percentage
time smoker
Dinner No 106 43.442623
Yes 70 28.688525
Lunch No 45 18.442623
Yes 23 9.426230
Input df:
import random
df = pd.DataFrame({"time": ["Lunch"]*68 ["Dinner"]*176,
"smoker": ["Yes"]*23 ["No"]*45 ["Yes"]*70 ["No"]*106,
"sex": random.choices(["Male","Female"],k=100)})