I have a dataset with 21 columns there are 2 columns that has 25% missing values, I'm reluctant to drop them or not? Is it make sence to drop columns that has more than 20% of its data as missing, or how can I determine the percentage of missing values that decide to drop the column
I dropped the columns that have 20% or more missing values, I am expecting to know the best way to determine this percentage amount for example: should I use 20% or 40% or higher?
CodePudding user response:
One approach
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 5, 21 * 5).reshape(-1, 21)).replace({0: np.nan})
print('Original df\n',df)
df = df.loc[:, df.isna().sum().div(df.shape[0]).le(0.25)]
print('\nResult df without columns > 25% missing values\n',df)
Original df
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0 2.0 NaN 3.0 NaN NaN 3.0 2.0 NaN 1.0 4.0 NaN 4.0 3.0 1.0 3.0 4.0 1.0 1.0 1.0 4.0 NaN
1 3.0 NaN 3.0 NaN 4.0 NaN 3.0 NaN NaN 1.0 NaN 2.0 1.0 4.0 NaN 2.0 3.0 NaN 4.0 NaN 2.0
2 4.0 NaN 2.0 NaN NaN 1.0 2.0 4.0 1.0 4.0 4.0 1.0 3.0 2.0 2.0 4.0 NaN NaN 2.0 3.0 4.0
3 2.0 NaN 4.0 4.0 3.0 3.0 2.0 NaN 2.0 3.0 4.0 NaN 1.0 4.0 NaN 2.0 2.0 3.0 3.0 4.0 1.0
4 4.0 NaN 4.0 3.0 4.0 1.0 4.0 NaN NaN NaN 2.0 2.0 NaN 2.0 2.0 2.0 NaN 2.0 NaN 2.0 NaN
Result df without columns > 25% missing values
0 2 5 6 9 11 12 13 15 18 19
0 2.0 3.0 3.0 2.0 4.0 4.0 3.0 1.0 4.0 1.0 4.0
1 3.0 3.0 NaN 3.0 1.0 2.0 1.0 4.0 2.0 4.0 NaN
2 4.0 2.0 1.0 2.0 4.0 1.0 3.0 2.0 4.0 2.0 3.0
3 2.0 4.0 3.0 2.0 3.0 NaN 1.0 4.0 2.0 3.0 4.0
4 4.0 4.0 1.0 4.0 NaN 2.0 NaN 2.0 2.0 NaN 2.0
CodePudding user response:
how can I determine the percentage of missing values
You might do it following way
import pandas as pd
df = pd.DataFrame({'X':[1,2,3],'Y':[4,5,None],'Z':[7,None,None]})
missing = df.isnull().mean() * 100
print(missing)
output
X 0.000000
Y 33.333333
Z 66.666667
dtype: float64
Explanation: .isnull()
gives True
or False
, as they are treated as 1
and 0
when doing arithemtic, getting mean
will give value 0.0
(nothing missing) to 1.0
(all missing) which you need to multiply by 100
to get percentage.