Is this group by size behavior correct?-CodePudding

I have this sample dataset:

mydf = pd.DataFrame({'city':['Porto','Loa','Porto','Porto','Loa'],\
                     'town':['A','C','A','B','C']})
mydf['city'] = pd.Categorical(mydf['city'])
mydf['town'] = pd.Categorical(mydf['town'])
mydf
    city    town
0   Porto   A
1   Loa     C
2   Porto   A
3   Porto   B
4   Loa     C

And I want to count the occurrences grouped by city and town. So I tried this:

mydf.groupby(['city','town']).size().to_frame()
              0
city    town    
Loa     A     0
        B     0
        C     2
Porto   A     2
        B     1
        C     0

But this is wrong, since city C is located only in Loa, not in Porto, and cities A and B are located only in Porto. My expected result is this:

              0
city    town    
Loa     C     2
Porto   A     2
        B     1

Sure I can avoid the pd.Categorical conversion in 'city' and 'town', but I don't understand that behavior. Is there a parameter I should use to avoid this and get the right and simplified expected result?

CodePudding user response：

Yes, the groupby size behavior is expected.

By default, if any of the grouping columns are categorical then it will show all the values for categorical columns regardless whether they appear in a particular group or not.

To turn this default behaviors off, you can set the optional parameter observed=True in groupby which will show only observed values(actual appearing values) of categorical columns:

mydf.groupby(['city','town'], observed=True).size().to_frame()

            0
city  town   
Porto A     2
      B     1
Loa   C     2

CodePudding user response：

Use:

from collections import Counter
mydf.groupby(['city'], observed=True)['town'].apply(lambda x: Counter(x))

Output:

city    
Porto  A    2.0
       B    1.0
Loa    C    2.0
Name: town, dtype: float64