Count missing values of several variables by group with many variables-CodePudding

I would like to count the missing variables in a dataframe. I have many variables that I wish to count, and many variables that I wish to group by. I know how to do this for one variable in the group, but not several:

import pandas as pd

df=pd.DataFrame({'type':['A', 'A', 'B'], 'var1':[1, None, 2],'var2':[None, None, 3]})

vars=['var1', 'var2']

df.drop('type', 1)[vars].isna().groupby(df.type, sort=False).sum().reset_index()

Attempts:

df=pd.DataFrame({'type':['A', 'A', 'B'], 'type2':['Z', 'Y', 'Y'], 'var1':[1, None, 2], 
                'var2':[None, None, 3]})
df.drop(['type', 'type2'], 1).isna().groupby(df[['type', 'type2']], sort=False).sum().reset_index()
grouping_vars = ['type', 'type2']
df.drop(grouping_vars, 1).isna().groupby(df[grouping_vars], sort=False).sum().reset_index()

Both attempts above give me

ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional.

I'm guessing I'm not understanding something fundamental about groupby here, so any help is much appreciated!

CodePudding user response：

You can not use dataframe as the grouper since it is two dimensional object. Groupby requires one dimensional arrays/series as grouper. So in order to fix the error you can create a list of one dimensional series(grp_cols) and then use it as a grouper to count the missing values

keys = ['type', 'type2']
grp_cols = [df[k] for k in keys]
df.drop(keys, axis=1).isna().groupby(grp_cols, sort=False).sum().reset_index()

Alternatively, there is a simpler approach where you have to first set the grouping columns as index then use isna followed by groupby sum to count the missing values

keys = ['type', 'type2']
df.set_index(keys).isna().groupby(keys, sort=False).sum().reset_index()

  type type2  var1  var2
0    A     Z     0     1
1    A     Y     1     1
2    B     Y     0     0