I have this sample dataset:
mydf = pd.DataFrame({'city':['Porto','Loa','Porto','Porto','Loa'],\
'town':['A','C','A','B','C']})
mydf['city'] = pd.Categorical(mydf['city'])
mydf['town'] = pd.Categorical(mydf['town'])
mydf
city town
0 Porto A
1 Loa C
2 Porto A
3 Porto B
4 Loa C
And I want to count the occurrences grouped by city and town. So I tried this:
mydf.groupby(['city','town']).size().to_frame()
0
city town
Loa A 0
B 0
C 2
Porto A 2
B 1
C 0
But this is wrong, since city C is located only in Loa, not in Porto, and cities A and B are located only in Porto. My expected result is this:
0
city town
Loa C 2
Porto A 2
B 1
Sure I can avoid the pd.Categorical
conversion in 'city' and 'town', but I don't understand that behavior. Is there a parameter I should use to avoid this and get the right and simplified expected result?
CodePudding user response:
Yes, the groupby
size
behavior is expected.
By default, if any of the grouping columns are categorical then it will show all the values for categorical columns regardless whether they appear in a particular group or not.
To turn this default behaviors off, you can set the optional parameter observed=True
in groupby
which will show only observed values(actual appearing values) of categorical columns:
mydf.groupby(['city','town'], observed=True).size().to_frame()
0
city town
Porto A 2
B 1
Loa C 2
CodePudding user response:
Use:
from collections import Counter
mydf.groupby(['city'], observed=True)['town'].apply(lambda x: Counter(x))
Output:
city
Porto A 2.0
B 1.0
Loa C 2.0
Name: town, dtype: float64