Pandas replace for multiple cols not working-CodePudding

I am trying to access multiple Pandas DataFrame columns of the Pima Indians Diabetes Dataset (those ones listed in cols_missing) and replace existing zeros with np.NaN.

# Replace missing values (meaningless 0s) in df with NaN
cols_missing_vals = df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI','Age']]
df = cols_missing_vals.replace(to_replace=0, value=np.NaN, inplace=False)
# Fill NaN values with mean imputation
df = df.fillna(value=df.mean(), inplace=False)

When I try to operate with inplace=False in order to avoid chained assignments and reassign the variable, the operations are performed but the three columns where no replace operation takes place are dropped.

    Glucose BloodPressure   SkinThickness   Insulin BMI Age
0   148.0   72.0    35.00000    155.548223  33.6    50
1   85.0    66.0    29.00000    155.548223  26.6    31
2   183.0   64.0    29.15342    155.548223  23.3    32
3   89.0    66.0    23.00000    94.000000   28.1    21
4   137.0   40.0    35.00000    168.000000  43.1    33
... ... ... ... ... ... ...
763 101.0   76.0    48.00000    180.000000  32.9    63
764 122.0   70.0    27.00000    155.548223  36.8    27
765 121.0   72.0    23.00000    112.000000  26.2    30
766 126.0   60.0    29.15342    155.548223  30.1    47
767 93.0    70.0    31.00000    155.548223  30.4    23

However, once I use inplace=True and skip reassigning the variable, no replacement operation takes place at all.

cols_missing_vals = df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI','Age']]
cols_missing_vals.replace(to_replace=0, value=np.NaN, inplace=True)
# Fill NaN values with mean imputation
df.fillna(value=df.mean(), inplace=True)

    Pregnancies Glucose BloodPressure   SkinThickness   Insulin BMI DiabetesPedigreeFunction    Age Outcome
0   6   148 72  35  0   33.6    0.627   50  1
1   1   85  66  29  0   26.6    0.351   31  0
2   8   183 64  0   0   23.3    0.672   32  1
3   1   89  66  23  94  28.1    0.167   21  0
4   0   137 40  35  168 43.1    2.288   33  1
... ... ... ... ... ... ... ... ... ...
763 10  101 76  48  180 32.9    0.171   63  0
764 2   122 70  27  0   36.8    0.340   27  0
765 5   121 72  23  112 26.2    0.245   30  0
766 1   126 60  0   0   30.1    0.349   47  1
767 1   93  70  31  0   30.4    0.315   23  0

How can I ensure to keep the entire initial df, but with successful mean imputation for those columns where it is useful? E.g., while skin thickness can't be 0 and which indicates a missing value, pregnancies is totally okay with being 0, however I want to keep both of them in the df. Thanks!

CodePudding user response：

Try this, as per this answer:

cols_missing_vals = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI','Age']
df[cols_missing_vals] = df[cols_missing_vals].replace(to_replace=0, value=np.NaN)
df[cols_missing_vals] = df[cols_missing_vals].fillna(value=df.mean())

CodePudding user response：

The scikit learn package in python has functionality for mean imputation.

I first created a toy pandas dataframe to illustrate:

from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd

columns =     ["Pregnancies", "Glucose", "BloodPressure",   "SkinThickness",   "Insulin", "BM", "DiabetesPedigreeFunction",    "Age", "Outcome"]
data = np.random.randint(low=0, high=10, size = (10, len(columns)))
data = pd.DataFrame(data = data, columns = columns)
data

>       Pregnancies Glucose BloodPressure   SkinThickness   Insulin BM  DiabetesPedigreeFunction    Age Outcome
>     0 7   7   7   6   8   0   9   0   0
>     1 7   9   0   0   0   2   5   8   8
>     2 3   3   6   3   0   9   9   0   7
>     3 5   1   4   6   8   7   2   8   6
>     4 7   4   4   7   7   8   7   1   4
>     5 8   8   2   7   4   5   0   5   9
>     6 7   0   3   4   9   6   0   9   3
>     7 6   4   0   8   5   0   1   2   5
>     8 6   1   8   3   6   0   2   0   2
>     9 3   0   1   8   0   3   3   0   6

Here, I apply that functionality to impute the 0 values (which are your missing_values).

impute_mean = SimpleImputer(missing_values = 0, strategy = "mean")
data = pd.DataFrame(data = impute_mean.fit_transform(data), columns = columns)
data

>   Pregnancies Glucose BloodPressure   SkinThickness   Insulin BM  DiabetesPedigreeFunction    Age Outcome
> 0 7.0 7.000   7.000   6.000000    8.000000    5.714286    9.00    5.5 5.555556
> 1 7.0 9.000   4.375   5.777778    6.714286    2.000000    5.00    8.0 8.000000
> 2 3.0 3.000   6.000   3.000000    6.714286    9.000000    9.00    5.5 7.000000
> 3 5.0 1.000   4.000   6.000000    8.000000    7.000000    2.00    8.0 6.000000
> 4 7.0 4.000   4.000   7.000000    7.000000    8.000000    7.00    1.0 4.000000
> 5 8.0 8.000   2.000   7.000000    4.000000    5.000000    4.75    5.0 9.000000
> 6 7.0 4.625   3.000   4.000000    9.000000    6.000000    4.75    9.0 3.000000
> 7 6.0 4.000   4.375   8.000000    5.000000    5.714286    1.00    2.0 5.000000
> 8 6.0 1.000   8.000   3.000000    6.000000    5.714286    2.00    5.5 2.000000
> 9 3.0 4.625   1.000   8.000000    6.714286    3.000000    3.00    5.5 6.000000