I have a large dataframe with over 100 columns. One of the columns is "Cow". For each value of "Cow" I would like to determine the number of missing values in each of the other columns.
Using code from Get proportion of missing values per Country
I am able to tabulate the number of missing values for one column at a time. By repeating the code for each column and then merging the dataframes I am able to build a dataframe that has the proportion of missing values for each cow for each column. The problem is that I have over 100 columns.
The following creates a short data example
import pandas as pd
import numpy as np
mast_model_data = [[1152, '1', '10', '23'], [1154, '1', '4', '43'],
[1155, 'NA', '3', '76'], [1152, '1', '10', 'NA'],
[1155, '2', '10', '65'], [1152, '1', '4', 'NA']]
df = pd.DataFrame(mast_model_data, columns =['Cow', 'Lact', 'Procedure', 'Height'])
df.loc[:,'Lact'] = df['Lact'].replace('NA', np.nan)
df.loc[:,'Procedure'] = df['Procedure'].replace('NA', np.nan)
df.loc[:,'Height'] = df['Height'].replace('NA', np.nan)
df
The data is presented below
Cow Lact Procedure Height
0 1152 1 10 23
1 1154 1 4 43
2 1155 NaN 3 76
3 1152 1 10 NaN
4 1155 2 10 65
5 1152 1 4 NaN
The code that I am using to tabulate missing data is as follows
df1 = (df.groupby('Cow')['Lact']
.apply(lambda x: np.mean(x.isna().to_numpy(), axis=None))
.reset_index(name='Lact'))
df2 = (df.groupby('Cow')['Procedure']
.apply(lambda x: np.mean(x.isna().to_numpy(), axis=None))
.reset_index(name='Procedure'))
df3 = (df.groupby('Cow')['Height']
.apply(lambda x: np.mean(x.isna().to_numpy(), axis=None))
.reset_index(name='Height'))
missing = df1.merge(df2, on=['Cow'], how="left")
missing = missing.merge(df3, on=['Cow'], how="left")
missing
The output of the code above is
Cow Lact Procedure Height
0 1152 0.0 0.0 0.666667
1 1154 0.0 0.0 0.000000
2 1155 0.5 0.0 0.000000
The actual dataframe has more cows and columns so to complete the table will require a lot of repitition
I anticipate there is a more refined way that does not require the repetition required for the method that I am using.
Appreciate advice on how I can streamline the code.
CodePudding user response:
Try as follows:
missing = df.set_index('Cow').isna().groupby(level=0).mean()\
.reset_index(drop=False)
print(missing)
Cow Lact Procedure Height
0 1152 0.0 0.0 0.666667
1 1154 0.0 0.0 0.000000
2 1155 0.5 0.0 0.000000
Explanation
- Set column
Cow
as the index, and applydf.isna
to get a mask of bool values withTrue
for NaN values. - Now, chain
df.groupby
on the index (i.e.level=0
), retrieve themean
, and reset the index again.