Imputing Missing Values in Python-CodePudding

I want to impute a couple of columns in my data frame using Scikit-Learn SimpleImputer. I tried doing this, but with no luck. How should I modify my code? a, b, e are the columns in my data frame that I want to impute.

My data frame:

    a   b   c   d      e
    NA  39  cat gray   20
    5   NA  dog brown  NA
    7   53  cat tan    33
    NA  NA  cat black  41
    4   24  dog tan    NA

My code:

from sklearn.impute import SimpleImputer

miss_mean_imputer = SimpleImputer(missing_values='NaN', strategy='mean', axis=0)

miss_mean_imputer = miss_mean_imputer.fit(df["a", "b", "e"])

imputed_df = miss_mean_imputer.transform(df.values)

print(imputed_df)

CodePudding user response：

You should replace missing_values='NaN' with missing_values=np.nan when instantiating the imputer and you should also make sure that the imputer is used to transform the same data to which it has been fitted, see the code below.

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

df = pd.DataFrame({
 'a': [np.nan, 5.0, 7.0, np.nan, 4.0],
 'b': [39.0, np.nan, 53.0, np.nan, 24.0],
 'c': ['cat', 'dog', 'cat', 'cat', 'dog'],
 'd': ['gray', 'brown', 'tan', 'black', 'tan'],
 'e': [20.0, np.nan, 33.0, 41.0, np.nan]
})

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(df[['a', 'b', 'e']])

imputed_df = df.copy()
imputed_df[['a', 'b', 'e']] = imputer.transform(df[['a', 'b', 'e']])

print(imputed_df)
#           a          b    c      d          e
# 0  5.333333  39.000000  cat   gray  20.000000
# 1  5.000000  38.666667  dog  brown  31.333333
# 2  7.000000  53.000000  cat    tan  33.000000
# 3  5.333333  38.666667  cat  black  41.000000
# 4  4.000000  24.000000  dog    tan  31.333333