I want to impute a couple of columns in my data frame using Scikit-Learn SimpleImputer
. I tried doing this, but with no luck. How should I modify my code? a
, b
, e
are the columns in my data frame that I want to impute.
My data frame:
a b c d e
NA 39 cat gray 20
5 NA dog brown NA
7 53 cat tan 33
NA NA cat black 41
4 24 dog tan NA
My code:
from sklearn.impute import SimpleImputer
miss_mean_imputer = SimpleImputer(missing_values='NaN', strategy='mean', axis=0)
miss_mean_imputer = miss_mean_imputer.fit(df["a", "b", "e"])
imputed_df = miss_mean_imputer.transform(df.values)
print(imputed_df)
CodePudding user response:
You should replace missing_values='NaN'
with missing_values=np.nan
when instantiating the imputer and you should also make sure that the imputer is used to transform the same data to which it has been fitted, see the code below.
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
df = pd.DataFrame({
'a': [np.nan, 5.0, 7.0, np.nan, 4.0],
'b': [39.0, np.nan, 53.0, np.nan, 24.0],
'c': ['cat', 'dog', 'cat', 'cat', 'dog'],
'd': ['gray', 'brown', 'tan', 'black', 'tan'],
'e': [20.0, np.nan, 33.0, 41.0, np.nan]
})
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(df[['a', 'b', 'e']])
imputed_df = df.copy()
imputed_df[['a', 'b', 'e']] = imputer.transform(df[['a', 'b', 'e']])
print(imputed_df)
# a b c d e
# 0 5.333333 39.000000 cat gray 20.000000
# 1 5.000000 38.666667 dog brown 31.333333
# 2 7.000000 53.000000 cat tan 33.000000
# 3 5.333333 38.666667 cat black 41.000000
# 4 4.000000 24.000000 dog tan 31.333333