I'm new to python and pandas. I have a data set with the columns age, sex, bmi, children, smoker, region, charges. I want to write a query to count all the smokers that answered "yes" based on their region. the regions can only be northwest, northeast, southwest, southeast.
I have tried several groupby and series commands but i'm not getting it. Can anyone help? Thanks in advance
I've tried:
data.groupby('Region').count()
data.groupby('Region').apply(lambda g: pd.Series(g['Smoker'].str.contains("y").count()))
data['Smoker'].value_counts().reindex(['Region'])
None of them worked.
CodePudding user response:
You can go for this :
df.groupby(['region', 'smoker']).count()
CodePudding user response:
One easy way is to first convert the Smoker column to a boolean:
data['Smoker'].str.contains('y')
And to chain this to a groupby.sum
:
data['Smoker'].str.contains('y').groupby(data['Region']).sum()
Ensure to use a Series
as grouper (not just the column name).
An alternative would be to use a custom function as you did, but with sum
:
data.groupby('Region')['Smoker'].apply(lambda g: g.str.contains('y').sum())
Example output:
Region
A 3
B 1
C 0
Name: Smoker, dtype: int64
Used input:
Region Smoker
0 A yes
1 A yes
2 A yes
3 B yes
4 B no
5 B no
6 C no
7 C no
8 C no