I would like to count the missing variables in a dataframe. I have many variables that I wish to count, and many variables that I wish to group by. I know how to do this for one variable in the group, but not several:
import pandas as pd
df=pd.DataFrame({'type':['A', 'A', 'B'], 'var1':[1, None, 2],'var2':[None, None, 3]})
vars=['var1', 'var2']
df.drop('type', 1)[vars].isna().groupby(df.type, sort=False).sum().reset_index()
Attempts:
df=pd.DataFrame({'type':['A', 'A', 'B'], 'type2':['Z', 'Y', 'Y'], 'var1':[1, None, 2],
'var2':[None, None, 3]})
df.drop(['type', 'type2'], 1).isna().groupby(df[['type', 'type2']], sort=False).sum().reset_index()
grouping_vars = ['type', 'type2']
df.drop(grouping_vars, 1).isna().groupby(df[grouping_vars], sort=False).sum().reset_index()
Both attempts above give me
ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional.
I'm guessing I'm not understanding something fundamental about groupby
here, so any help is much appreciated!
CodePudding user response:
You can not use dataframe as the grouper since it is two dimensional object. Groupby requires one dimensional arrays/series as grouper. So in order to fix the error you can create a list of one dimensional series(grp_cols
) and then use it as a grouper to count the missing values
keys = ['type', 'type2']
grp_cols = [df[k] for k in keys]
df.drop(keys, axis=1).isna().groupby(grp_cols, sort=False).sum().reset_index()
Alternatively, there is a simpler approach where you have to first set the grouping columns as index then use isna
followed by groupby
sum
to count the missing values
keys = ['type', 'type2']
df.set_index(keys).isna().groupby(keys, sort=False).sum().reset_index()
type type2 var1 var2
0 A Z 0 1
1 A Y 1 1
2 B Y 0 0