Pandas groupby() and agg() method confusion on columns-CodePudding

Can I check what is the difference between

df[['column1', 'column2']].groupby('column1').agg(['mean', 'count'])

and

df[['column1', 'column2']].groupby('column1').agg({'column2': 'mean', 'column2': 'count'})

In the first example, mean and count is performed on column2 which is not in groupby.

In the second example, same logic but I had explicitly mentioned column2 in agg.

Why do I not see the same result for both?

CodePudding user response：

TLDR

The problem with the second statement has to due with overwriting the column.

There are at least three ways to do this statement.

First let's build a test dataset:

import pandas as pd
from seaborn import load_dataset

df_tips = load_dataset('tips')

df_tips.head()

Statement 1: Same as your first wy

df_tips[['sex','size']].groupby(['sex']).agg(['mean','count'])

Output:

            size      
            mean count
sex                   
Male    2.630573   157
Female  2.459770    87

A dataframe with a multiindex column header size and level=1 both aggregations.

Statement 2: Using a list of aggregrations for 'size' in a dictionary

df_tips[['sex','size']].groupby(['sex']).agg({'size':['mean','count']})

Output (same as above)

            size      
            mean count
sex                   
Male    2.630573   157
Female  2.459770    87

Statement 3: Using named aggregrations

df_tips[['sex','size']].groupby(['sex']).agg(mean_size=('size','mean'),count_size=('size','count'))

Output:

        mean_size  count_size
sex                          
Male     2.630573         157
Female   2.459770          87

This give a dataframe with a 'flatten' column header that you name yourself, however that name must not contain a space or special characters.

The incorrect way is your second method

df_tips[['sex','size']].groupby(['sex']).agg({'size':'mean','size':'count'})

Outputs:

        size
sex         
Male     157
Female    87

What is happening here is that you are getting two columns one for each aggregations but the column header is the same 'size', therefore the first iteration is getting overwritten with the second 'count' in this case.