I am trying to access multiple Pandas DataFrame columns of the Pima Indians Diabetes Dataset (those ones listed in cols_missing) and replace existing zeros with np.NaN.
# Replace missing values (meaningless 0s) in df with NaN
cols_missing_vals = df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI','Age']]
df = cols_missing_vals.replace(to_replace=0, value=np.NaN, inplace=False)
# Fill NaN values with mean imputation
df = df.fillna(value=df.mean(), inplace=False)
When I try to operate with inplace=False in order to avoid chained assignments and reassign the variable, the operations are performed but the three columns where no replace operation takes place are dropped.
Glucose BloodPressure SkinThickness Insulin BMI Age
0 148.0 72.0 35.00000 155.548223 33.6 50
1 85.0 66.0 29.00000 155.548223 26.6 31
2 183.0 64.0 29.15342 155.548223 23.3 32
3 89.0 66.0 23.00000 94.000000 28.1 21
4 137.0 40.0 35.00000 168.000000 43.1 33
... ... ... ... ... ... ...
763 101.0 76.0 48.00000 180.000000 32.9 63
764 122.0 70.0 27.00000 155.548223 36.8 27
765 121.0 72.0 23.00000 112.000000 26.2 30
766 126.0 60.0 29.15342 155.548223 30.1 47
767 93.0 70.0 31.00000 155.548223 30.4 23
However, once I use inplace=True and skip reassigning the variable, no replacement operation takes place at all.
cols_missing_vals = df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI','Age']]
cols_missing_vals.replace(to_replace=0, value=np.NaN, inplace=True)
# Fill NaN values with mean imputation
df.fillna(value=df.mean(), inplace=True)
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
... ... ... ... ... ... ... ... ... ...
763 10 101 76 48 180 32.9 0.171 63 0
764 2 122 70 27 0 36.8 0.340 27 0
765 5 121 72 23 112 26.2 0.245 30 0
766 1 126 60 0 0 30.1 0.349 47 1
767 1 93 70 31 0 30.4 0.315 23 0
How can I ensure to keep the entire initial df, but with successful mean imputation for those columns where it is useful? E.g., while skin thickness can't be 0 and which indicates a missing value, pregnancies is totally okay with being 0, however I want to keep both of them in the df. Thanks!
CodePudding user response:
Try this, as per this answer:
cols_missing_vals = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI','Age']
df[cols_missing_vals] = df[cols_missing_vals].replace(to_replace=0, value=np.NaN)
df[cols_missing_vals] = df[cols_missing_vals].fillna(value=df.mean())
CodePudding user response:
The scikit learn package in python has functionality for mean imputation.
I first created a toy pandas dataframe to illustrate:
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd
columns = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BM", "DiabetesPedigreeFunction", "Age", "Outcome"]
data = np.random.randint(low=0, high=10, size = (10, len(columns)))
data = pd.DataFrame(data = data, columns = columns)
data
> Pregnancies Glucose BloodPressure SkinThickness Insulin BM DiabetesPedigreeFunction Age Outcome
> 0 7 7 7 6 8 0 9 0 0
> 1 7 9 0 0 0 2 5 8 8
> 2 3 3 6 3 0 9 9 0 7
> 3 5 1 4 6 8 7 2 8 6
> 4 7 4 4 7 7 8 7 1 4
> 5 8 8 2 7 4 5 0 5 9
> 6 7 0 3 4 9 6 0 9 3
> 7 6 4 0 8 5 0 1 2 5
> 8 6 1 8 3 6 0 2 0 2
> 9 3 0 1 8 0 3 3 0 6
Here, I apply that functionality to impute the 0
values (which are your missing_values
).
impute_mean = SimpleImputer(missing_values = 0, strategy = "mean")
data = pd.DataFrame(data = impute_mean.fit_transform(data), columns = columns)
data
> Pregnancies Glucose BloodPressure SkinThickness Insulin BM DiabetesPedigreeFunction Age Outcome
> 0 7.0 7.000 7.000 6.000000 8.000000 5.714286 9.00 5.5 5.555556
> 1 7.0 9.000 4.375 5.777778 6.714286 2.000000 5.00 8.0 8.000000
> 2 3.0 3.000 6.000 3.000000 6.714286 9.000000 9.00 5.5 7.000000
> 3 5.0 1.000 4.000 6.000000 8.000000 7.000000 2.00 8.0 6.000000
> 4 7.0 4.000 4.000 7.000000 7.000000 8.000000 7.00 1.0 4.000000
> 5 8.0 8.000 2.000 7.000000 4.000000 5.000000 4.75 5.0 9.000000
> 6 7.0 4.625 3.000 4.000000 9.000000 6.000000 4.75 9.0 3.000000
> 7 6.0 4.000 4.375 8.000000 5.000000 5.714286 1.00 2.0 5.000000
> 8 6.0 1.000 8.000 3.000000 6.000000 5.714286 2.00 5.5 2.000000
> 9 3.0 4.625 1.000 8.000000 6.714286 3.000000 3.00 5.5 6.000000