Python - Grouping by multiple columns (Categorical dtype vs numeric dtype)- why are the results so d-CodePudding

I have a question about grouping pandas DataFrames by multiple columns. I am looking at some data for a TV show and trying to ensure that no season has two contestants with the same name.

Series	Name
1	David
1	Edward
1	Jasmine
2	Lea
2	Jonathan
2	Louise

I want a unique count for groupings of Series Name, which works well when the Series contains a numeric data type. I can do:

df.groupby(['Series','Name'])['Name'].count()

and get

Series	Name	Count
1	David	1
1	Edward	1
1	Jasmine	1
2	Lea	1
2	Jonathan	1
2	Louise	1

However, if series is set to a categorical data type then

df.groupby(['Series','Name'])['Name'].count()

returns the following table

Series	Name	Count
1	David	1
2	David	0
1	Edward	1
2	Edward	0
1	Jasmine	1
2	Jasmine	0
1	Jonathan	0
2	Jonathan	1
1	Lea	0
2	Lea	1
1	Louise	0
2	Louise	1

Panda groups every possible combination of series and names and then sorts alphanumerically. I don't understand why. Any help would be most appreciated.

CodePudding user response：

Actually, this is the default behaviour of Pandas when grouping categorical value, it adds missing categories (checkout this thread). To group only the observed categories on the dataframe you can use:

df.groupby(['Series','Name'],observed=True)['Name'].count()