Home > Blockchain >  dropna=True behaviour in pandas.DataFrame.groupby and pandas.DataFrame.pivot_table
dropna=True behaviour in pandas.DataFrame.groupby and pandas.DataFrame.pivot_table

Time:10-24

I'm trying to inspect the behaviour of the pandas.DataFrame.groupby and pandas.DataFrame.pivot_table methods and I've come up to this difference which I can't explain by myself.

It seems that the specification of dropna=True (default for both) has different consequences in the two cases, which might be somehow enforced by the different descriptions which are given within the docs.

For enter image description here

With .groupby() I'd instead get (while expecting the same result obtained above):

df.groupby(by=['country_live', 'employment_status'], dropna=True)['age'] \
    .mean() \
    .unstack()

enter image description here

Can someone explain the reason(s) why the two do not work the same (thus implicitly explaining the behaviour of dropna in .groupby())?

CodePudding user response:

Replying to your comment here since code formatting is a pain in comments.

I don't know exactly how you tried dropna=False for groupby, but running the following code will show the group with nan value for country_live:

df = pd.DataFrame({
    'age': [31, np.nan, 28, 22, 54, np.nan, 49, 60, 25, np.nan],
    'country_live': ['Italy', pd.NA, 'Italy', 'Spain', 'France', 'Italy', 'Spain', 'Spain', 'France', 'Spain'],
    'employment_status': ['Fully employed by a company / organization', 'Partially employed by a company / organization',
    'Working student', 'Working student', 'Fully employed by a company / organization', 'Partially employed by a company / organization',
    'Fully employed by a company / organization', 'Fully employed by a company / organization', 'Working student',
    'Partially employed by a company / organization']
    },
)

df = df.assign(age=lambda t: t['age'].astype('Int64'), \
    country_live=lambda t: t['country_live'].astype('category'), \
    employment_status=lambda t: t['employment_status'].astype('category'))

for gp, sub_df in df.groupby(by=['country_live', 'employment_status'], dropna=False):
    print(gp, sub_df, "\n", sep="\n")

Output (see last lines):

('France', 'Fully employed by a company / organization')
   age country_live                           employment_status
4   54       France  Fully employed by a company / organization


('France', 'Working student')
   age country_live employment_status
8   25       France   Working student


('Italy', 'Fully employed by a company / organization')
   age country_live                           employment_status
0   31        Italy  Fully employed by a company / organization


('Italy', 'Partially employed by a company / organization')
    age country_live                               employment_status
5  <NA>        Italy  Partially employed by a company / organization


('Italy', 'Working student')
   age country_live employment_status
2   28        Italy   Working student


('Spain', 'Fully employed by a company / organization')
   age country_live                           employment_status
6   49        Spain  Fully employed by a company / organization
7   60        Spain  Fully employed by a company / organization


('Spain', 'Partially employed by a company / organization')
    age country_live                               employment_status
9  <NA>        Spain  Partially employed by a company / organization


('Spain', 'Working student')
   age country_live employment_status
3   22        Spain   Working student


(nan, 'Partially employed by a company / organization')
    age country_live                               employment_status
1  <NA>          NaN  Partially employed by a company / organization

Respectively, the nan group will be ignored if you set dropna=True

CodePudding user response:

The main difference is that for .groupby() the dropna=True refers to the groups you are creating, NOT to the values. In fact if you add a row to your df:

row = {'age':50,'country_live':np.nan,'employment_status':'Partially employed by a company / organization'}

df = df.append(row, ignore_index=True)

the pivot table does not change the output changing the bool of dropna (you don't have the nan group in the index.

The situation changes in the groupby: With dropna=True you have the same result you obtained, with dropna=False, the nan group is added

  • Related