Missing dataframe column percentage-CodePudding

I have a dataset with 21 columns there are 2 columns that has 25% missing values, I'm reluctant to drop them or not? Is it make sence to drop columns that has more than 20% of its data as missing, or how can I determine the percentage of missing values that decide to drop the column

I dropped the columns that have 20% or more missing values, I am expecting to know the best way to determine this percentage amount for example: should I use 20% or 40% or higher?

CodePudding user response：

One approach

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, 5, 21 * 5).reshape(-1, 21)).replace({0: np.nan})
print('Original df\n',df)
df = df.loc[:, df.isna().sum().div(df.shape[0]).le(0.25)]
print('\nResult df without columns > 25% missing values\n',df)

Original df
     0   1    2    3    4    5    6    7    8    9    10   11   12   13   14   15   16   17   18   19   20
0  2.0 NaN  3.0  NaN  NaN  3.0  2.0  NaN  1.0  4.0  NaN  4.0  3.0  1.0  3.0  4.0  1.0  1.0  1.0  4.0  NaN
1  3.0 NaN  3.0  NaN  4.0  NaN  3.0  NaN  NaN  1.0  NaN  2.0  1.0  4.0  NaN  2.0  3.0  NaN  4.0  NaN  2.0
2  4.0 NaN  2.0  NaN  NaN  1.0  2.0  4.0  1.0  4.0  4.0  1.0  3.0  2.0  2.0  4.0  NaN  NaN  2.0  3.0  4.0
3  2.0 NaN  4.0  4.0  3.0  3.0  2.0  NaN  2.0  3.0  4.0  NaN  1.0  4.0  NaN  2.0  2.0  3.0  3.0  4.0  1.0
4  4.0 NaN  4.0  3.0  4.0  1.0  4.0  NaN  NaN  NaN  2.0  2.0  NaN  2.0  2.0  2.0  NaN  2.0  NaN  2.0  NaN

Result df without columns > 25% missing values
     0    2    5    6    9    11   12   13   15   18   19
0  2.0  3.0  3.0  2.0  4.0  4.0  3.0  1.0  4.0  1.0  4.0
1  3.0  3.0  NaN  3.0  1.0  2.0  1.0  4.0  2.0  4.0  NaN
2  4.0  2.0  1.0  2.0  4.0  1.0  3.0  2.0  4.0  2.0  3.0
3  2.0  4.0  3.0  2.0  3.0  NaN  1.0  4.0  2.0  3.0  4.0
4  4.0  4.0  1.0  4.0  NaN  2.0  NaN  2.0  2.0  NaN  2.0

CodePudding user response：

how can I determine the percentage of missing values

You might do it following way

import pandas as pd
df = pd.DataFrame({'X':[1,2,3],'Y':[4,5,None],'Z':[7,None,None]})
missing = df.isnull().mean() * 100
print(missing)

output

X     0.000000
Y    33.333333
Z    66.666667
dtype: float64

Explanation: .isnull() gives True or False, as they are treated as 1 and 0 when doing arithemtic, getting mean will give value 0.0 (nothing missing) to 1.0 (all missing) which you need to multiply by 100 to get percentage.