Determine the proportion of missing values for each Cow-CodePudding

I have a large dataframe with over 100 columns. One of the columns is "Cow". For each value of "Cow" I would like to determine the number of missing values in each of the other columns.

Using code from Get proportion of missing values per Country

I am able to tabulate the number of missing values for one column at a time. By repeating the code for each column and then merging the dataframes I am able to build a dataframe that has the proportion of missing values for each cow for each column. The problem is that I have over 100 columns.

The following creates a short data example

import pandas as pd
import numpy as np

mast_model_data = [[1152, '1', '10', '23'], [1154, '1', '4', '43'],
       [1155, 'NA', '3', '76'], [1152, '1', '10', 'NA'],
       [1155, '2', '10', '65'], [1152, '1', '4', 'NA']]
    
df = pd.DataFrame(mast_model_data, columns =['Cow', 'Lact', 'Procedure', 'Height'])

df.loc[:,'Lact'] = df['Lact'].replace('NA', np.nan)
df.loc[:,'Procedure'] = df['Procedure'].replace('NA', np.nan)
df.loc[:,'Height'] = df['Height'].replace('NA', np.nan)

df

The data is presented below

    Cow  Lact Procedure Height
0   1152    1   10  23
1   1154    1   4   43
2   1155    NaN 3   76
3   1152    1   10  NaN
4   1155    2   10  65
5   1152    1   4   NaN

The code that I am using to tabulate missing data is as follows

df1 = (df.groupby('Cow')['Lact']
         .apply(lambda x: np.mean(x.isna().to_numpy(), axis=None))
         .reset_index(name='Lact'))

df2 = (df.groupby('Cow')['Procedure']
         .apply(lambda x: np.mean(x.isna().to_numpy(), axis=None))
         .reset_index(name='Procedure'))

df3 = (df.groupby('Cow')['Height']
         .apply(lambda x: np.mean(x.isna().to_numpy(), axis=None))
         .reset_index(name='Height'))

missing = df1.merge(df2, on=['Cow'], how="left")

missing = missing.merge(df3, on=['Cow'], how="left")

missing

The output of the code above is

    Cow   Lact  Procedure   Height
0   1152    0.0 0.0 0.666667
1   1154    0.0 0.0 0.000000
2   1155    0.5 0.0 0.000000

The actual dataframe has more cows and columns so to complete the table will require a lot of repitition

I anticipate there is a more refined way that does not require the repetition required for the method that I am using.

Appreciate advice on how I can streamline the code.

CodePudding user response：

Try as follows:

missing = df.set_index('Cow').isna().groupby(level=0).mean()\
    .reset_index(drop=False)

print(missing)

    Cow  Lact  Procedure    Height
0  1152   0.0        0.0  0.666667
1  1154   0.0        0.0  0.000000
2  1155   0.5        0.0  0.000000

Explanation

Set column Cow as the index, and apply df.isna to get a mask of bool values with True for NaN values.
Now, chain df.groupby on the index (i.e. level=0), retrieve the mean, and reset the index again.