I have a question about grouping pandas DataFrames by multiple columns. I am looking at some data for a TV show and trying to ensure that no season has two contestants with the same name.
Series | Name |
---|---|
1 | David |
1 | Edward |
1 | Jasmine |
2 | Lea |
2 | Jonathan |
2 | Louise |
I want a unique count for groupings of Series Name, which works well when the Series contains a numeric data type. I can do:
df.groupby(['Series','Name'])['Name'].count()
and get
Series | Name | Count |
---|---|---|
1 | David | 1 |
1 | Edward | 1 |
1 | Jasmine | 1 |
2 | Lea | 1 |
2 | Jonathan | 1 |
2 | Louise | 1 |
However, if series is set to a categorical data type then
df.groupby(['Series','Name'])['Name'].count()
returns the following table
Series | Name | Count |
---|---|---|
1 | David | 1 |
2 | David | 0 |
1 | Edward | 1 |
2 | Edward | 0 |
1 | Jasmine | 1 |
2 | Jasmine | 0 |
1 | Jonathan | 0 |
2 | Jonathan | 1 |
1 | Lea | 0 |
2 | Lea | 1 |
1 | Louise | 0 |
2 | Louise | 1 |
Panda groups every possible combination of series and names and then sorts alphanumerically. I don't understand why. Any help would be most appreciated.
CodePudding user response:
Actually, this is the default behaviour of Pandas when grouping categorical value, it adds missing categories (checkout this thread). To group only the observed categories on the dataframe you can use:
df.groupby(['Series','Name'],observed=True)['Name'].count()