Home > OS >  Count missing values of several variables by group with many variables
Count missing values of several variables by group with many variables

Time:02-20

I would like to count the missing variables in a dataframe. I have many variables that I wish to count, and many variables that I wish to group by. I know how to do this for one variable in the group, but not several:

import pandas as pd

df=pd.DataFrame({'type':['A', 'A', 'B'], 'var1':[1, None, 2],'var2':[None, None, 3]})

vars=['var1', 'var2']

df.drop('type', 1)[vars].isna().groupby(df.type, sort=False).sum().reset_index()

Attempts:

df=pd.DataFrame({'type':['A', 'A', 'B'], 'type2':['Z', 'Y', 'Y'], 'var1':[1, None, 2], 
                'var2':[None, None, 3]})
df.drop(['type', 'type2'], 1).isna().groupby(df[['type', 'type2']], sort=False).sum().reset_index()
grouping_vars = ['type', 'type2']
df.drop(grouping_vars, 1).isna().groupby(df[grouping_vars], sort=False).sum().reset_index()

Both attempts above give me

ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional.

I'm guessing I'm not understanding something fundamental about groupby here, so any help is much appreciated!

CodePudding user response:

You can not use dataframe as the grouper since it is two dimensional object. Groupby requires one dimensional arrays/series as grouper. So in order to fix the error you can create a list of one dimensional series(grp_cols) and then use it as a grouper to count the missing values

keys = ['type', 'type2']
grp_cols = [df[k] for k in keys]
df.drop(keys, axis=1).isna().groupby(grp_cols, sort=False).sum().reset_index()

Alternatively, there is a simpler approach where you have to first set the grouping columns as index then use isna followed by groupby sum to count the missing values

keys = ['type', 'type2']
df.set_index(keys).isna().groupby(keys, sort=False).sum().reset_index()

  type type2  var1  var2
0    A     Z     0     1
1    A     Y     1     1
2    B     Y     0     0
  • Related