I'm trying to inspect the behaviour of the pandas.DataFrame.groupby
and pandas.DataFrame.pivot_table
methods and I've come up to this difference which I can't explain by myself.
It seems that the specification of dropna=True
(default for both) has different consequences in the two cases, which might be somehow enforced by the different descriptions which are given within the docs.
With .groupby()
I'd instead get (while expecting the same result obtained above):
df.groupby(by=['country_live', 'employment_status'], dropna=True)['age'] \
.mean() \
.unstack()
Can someone explain the reason(s) why the two do not work the same (thus implicitly explaining the behaviour of dropna
in .groupby()
)?
CodePudding user response:
Replying to your comment here since code formatting is a pain in comments.
I don't know exactly how you tried dropna=False
for groupby
, but running the following code will show the group with nan
value for country_live
:
df = pd.DataFrame({
'age': [31, np.nan, 28, 22, 54, np.nan, 49, 60, 25, np.nan],
'country_live': ['Italy', pd.NA, 'Italy', 'Spain', 'France', 'Italy', 'Spain', 'Spain', 'France', 'Spain'],
'employment_status': ['Fully employed by a company / organization', 'Partially employed by a company / organization',
'Working student', 'Working student', 'Fully employed by a company / organization', 'Partially employed by a company / organization',
'Fully employed by a company / organization', 'Fully employed by a company / organization', 'Working student',
'Partially employed by a company / organization']
},
)
df = df.assign(age=lambda t: t['age'].astype('Int64'), \
country_live=lambda t: t['country_live'].astype('category'), \
employment_status=lambda t: t['employment_status'].astype('category'))
for gp, sub_df in df.groupby(by=['country_live', 'employment_status'], dropna=False):
print(gp, sub_df, "\n", sep="\n")
Output (see last lines):
('France', 'Fully employed by a company / organization')
age country_live employment_status
4 54 France Fully employed by a company / organization
('France', 'Working student')
age country_live employment_status
8 25 France Working student
('Italy', 'Fully employed by a company / organization')
age country_live employment_status
0 31 Italy Fully employed by a company / organization
('Italy', 'Partially employed by a company / organization')
age country_live employment_status
5 <NA> Italy Partially employed by a company / organization
('Italy', 'Working student')
age country_live employment_status
2 28 Italy Working student
('Spain', 'Fully employed by a company / organization')
age country_live employment_status
6 49 Spain Fully employed by a company / organization
7 60 Spain Fully employed by a company / organization
('Spain', 'Partially employed by a company / organization')
age country_live employment_status
9 <NA> Spain Partially employed by a company / organization
('Spain', 'Working student')
age country_live employment_status
3 22 Spain Working student
(nan, 'Partially employed by a company / organization')
age country_live employment_status
1 <NA> NaN Partially employed by a company / organization
Respectively, the nan
group will be ignored if you set dropna=True
CodePudding user response:
The main difference is that for .groupby()
the dropna=True
refers to the groups you are creating, NOT to the values. In fact if you add a row to your df
:
row = {'age':50,'country_live':np.nan,'employment_status':'Partially employed by a company / organization'}
df = df.append(row, ignore_index=True)
the pivot table does not change the output changing the bool of dropna
(you don't have the nan
group in the index.
The situation changes in the groupby:
With dropna=True
you have the same result you obtained, with dropna=False
, the nan
group is added