Home > Mobile >  How to get feature names when using onehot encoder on only certain columns sklearn
How to get feature names when using onehot encoder on only certain columns sklearn

Time:05-18

I have read many posts on this that reference the get_feature_names() from sklearn which appears to be now deprecated and replaced by get_feature_names_out neither of which I can get to work. It also appears that there is no way to use the get_feature_names (or the get_feature_names_out) with the ColumnTransformer class. So I am trying to fit and transform my numeric columns with a SimpleImputer and then StandardScaler class then SimpleImpute ('most_frequent') and OneHotEncode the categorical variables. I run them all individually since I can't put them in a pipeline then I try to get_feature_names and this results:

ValueError: input_features should have length equal to number of features (5), got 11

I have also tried getting feature names for just the categorical features as well as just the numeric and each one give the following errors respectively:

ValueError: input_features should have length equal to number of features (5), got 121942

and

ValueError: input_features should have length equal to number of features (5), got 121942

I am completely lost and also open to an easier way to get the feature names so that I can make sure the prod data that I run this model on after training/testing has the exact same features as the ones the model is trained to expect (which is the root issue here).

If I'm "barking up the wrong tree" by trying to get the feature names for the reasoning outlined in the root issue I'm also more than willing to be corrected. Here is my code:

#ONE HOT
import sklearn
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# !pip install -U scikit-learn
print('The scikit-learn version is {}.'.format(sklearn.__version__))

numeric_columns = X.select_dtypes(include=['int64','float64']).columns
cat_columns = X.select_dtypes(include=['object']).columns


si_num = SimpleImputer(strategy='median')
si_cat = SimpleImputer(strategy='most_frequent')

ss = StandardScaler()
ohe = OneHotEncoder()

si_num.fit_transform(X[numeric_columns])
si_cat.fit_transform(X[cat_columns])
ss.fit_transform(X[numeric_columns])
ohe.fit_transform(X[cat_columns])

ohe.get_feature_names(X[numeric_columns])

Thanks!

CodePudding user response:

I think this should work as a single composite estimator that does all your transformations and provides get_feature_names_out:

num_pipe = Pipeline([
    ("imp", si_num),
    ("scale", ss),
])
cat_pipe = Pipeline([
    ("imp", si_cat),
    ("ohe", ohe),
])
preproc = ColumnTransformer([
    ("num", num_pipe, numeric_columns),
    ("cat", cat_pipe, cat_columns),
])

Ideally, you should save the fitted composite and use that to transform production data, rather than using the feature names to reconcile different categories.

You should also fit this composite only on the training set, transforming the test set separately.

  • Related