Pandas - Count column categorical value based on another column categorical value-CodePudding

I'm new to python and pandas. I have a data set with the columns age, sex, bmi, children, smoker, region, charges. I want to write a query to count all the smokers that answered "yes" based on their region. the regions can only be northwest, northeast, southwest, southeast.

I have tried several groupby and series commands but i'm not getting it. Can anyone help? Thanks in advance

I've tried:

data.groupby('Region').count()
data.groupby('Region').apply(lambda g: pd.Series(g['Smoker'].str.contains("y").count()))
data['Smoker'].value_counts().reindex(['Region'])

None of them worked.

CodePudding user response：

You can go for this :

df.groupby(['region', 'smoker']).count()

CodePudding user response：

One easy way is to first convert the Smoker column to a boolean:

data['Smoker'].str.contains('y')

And to chain this to a groupby.sum:

data['Smoker'].str.contains('y').groupby(data['Region']).sum()

Ensure to use a Series as grouper (not just the column name).

An alternative would be to use a custom function as you did, but with sum:

data.groupby('Region')['Smoker'].apply(lambda g: g.str.contains('y').sum())

Example output:

Region
A    3
B    1
C    0
Name: Smoker, dtype: int64

Used input:

  Region Smoker
0      A    yes
1      A    yes
2      A    yes
3      B    yes
4      B     no
5      B     no
6      C     no
7      C     no
8      C     no