Can I check what is the difference between
df[['column1', 'column2']].groupby('column1').agg(['mean', 'count'])
and
df[['column1', 'column2']].groupby('column1').agg({'column2': 'mean', 'column2': 'count'})
In the first example, mean
and count
is performed on column2
which is not in groupby
.
In the second example, same logic but I had explicitly mentioned column2
in agg
.
Why do I not see the same result for both?
CodePudding user response:
TLDR
The problem with the second statement has to due with overwriting the column.
There are at least three ways to do this statement.
First let's build a test dataset:
import pandas as pd
from seaborn import load_dataset
df_tips = load_dataset('tips')
df_tips.head()
Statement 1: Same as your first wy
df_tips[['sex','size']].groupby(['sex']).agg(['mean','count'])
Output:
size
mean count
sex
Male 2.630573 157
Female 2.459770 87
A dataframe with a multiindex column header size and level=1 both aggregations.
Statement 2: Using a list of aggregrations for 'size' in a dictionary
df_tips[['sex','size']].groupby(['sex']).agg({'size':['mean','count']})
Output (same as above)
size
mean count
sex
Male 2.630573 157
Female 2.459770 87
Statement 3: Using named aggregrations
df_tips[['sex','size']].groupby(['sex']).agg(mean_size=('size','mean'),count_size=('size','count'))
Output:
mean_size count_size
sex
Male 2.630573 157
Female 2.459770 87
This give a dataframe with a 'flatten' column header that you name yourself, however that name must not contain a space or special characters.
The incorrect way is your second method
df_tips[['sex','size']].groupby(['sex']).agg({'size':'mean','size':'count'})
Outputs:
size
sex
Male 157
Female 87
What is happening here is that you are getting two columns one for each aggregations but the column header is the same 'size', therefore the first iteration is getting overwritten with the second 'count' in this case.