OneHotEncoder doesn't remove categorical in pipeline-CodePudding

I have a lab working with preprocess data. And I try to use ColumnTransformer with pipeline syntax. I have some code below.

preprocess = ColumnTransformer(
                    [('imp_mean', SimpleImputer(strategy='mean'), numerics_cols),
                     ('imp_mode', SimpleImputer(strategy='most_frequent'), categorical_cols),
                     ('onehot', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
                     #('stander', StandardScaler(), fewer_cols_train_X_df.columns)
                    ])

After I run this code and call the pipeline the result is.

       ['female', 1.0, 0.0, 0.0],
       ['male', 0.0, 1.0, 0.0],
       ['female', 1.0, 0.0, 0.0],
       ['male', 0.0, 1.0, 0.0],
       ['male', 0.0, 1.0, 0.0],
       ['male', 0.0, 1.0, 0.0],
       ['male', 0.0, 1.0, 0.0],
       ['female', 1.0, 0.0, 0.0],
       ['male', 0.0, 1.0, 0.0],
       ['male', 0.0, 1.0, 0.0],
       ['male', 0.0, 1.0, 0.0],
       ['male', 0.0, 1.0, 0.0],
       ['male', 0.0, 1.0, 0.0],
       ['female', 1.0, 0.0, 0.0],
       ['female', 1.0, 0.0, 0.0],
       ['male', 0.0, 1.0, 0.0],

You can see the categorical is in the result. I try to drop it, but it's still here. So I just want to remove categorical in this result to run StandardScaler. I don't understand why it doesn't work. Thank you for reading.

CodePudding user response：

With ColumnTransformer you cannot perform sequential information on the different columns. This object will perform the first operation defined for a given column and then mark it as preprocessed.

Therefore in your example, categorical columns will only be imputed but will not be One-hot encoded.

To perform this operation (Imputing and One-hot Encoding on columns you should put these preprocessing on a Pipeline to perform them sequentially.

The example below is illustrating how to handle different processing for numerical and categorical features.

from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

X = pd.DataFrame({'gender' : ['male', 'male', 'female'],
                 'A' : [1, 10 , 20],
                 'B' : [1, 150 , 20]})

categorical_preprocessing = Pipeline(
[
    ('imp_mode', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore')),
])

numerical_preprocessing = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
])

preprocessing = ColumnTransformer(
                    [
                        ('catecorical', categorical_preprocessing,
                         make_column_selector(dtype_include=object)),
                        ('numerical', numerical_preprocessing,
                         make_column_selector(dtype_include=np.number)),
                    ])

preprocessing.fit_transform(X)

Output:

array([[ 0.        ,  1.        , -1.20270298, -0.84570663],
       [ 0.        ,  1.        , -0.04295368,  1.40447708],
       [ 1.        ,  0.        ,  1.24565666, -0.55877045]])