I have a lab working with preprocess data. And I try to use ColumnTransformer with pipeline syntax. I have some code below.
preprocess = ColumnTransformer(
[('imp_mean', SimpleImputer(strategy='mean'), numerics_cols),
('imp_mode', SimpleImputer(strategy='most_frequent'), categorical_cols),
('onehot', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
#('stander', StandardScaler(), fewer_cols_train_X_df.columns)
])
After I run this code and call the pipeline the result is.
['female', 1.0, 0.0, 0.0],
['male', 0.0, 1.0, 0.0],
['female', 1.0, 0.0, 0.0],
['male', 0.0, 1.0, 0.0],
['male', 0.0, 1.0, 0.0],
['male', 0.0, 1.0, 0.0],
['male', 0.0, 1.0, 0.0],
['female', 1.0, 0.0, 0.0],
['male', 0.0, 1.0, 0.0],
['male', 0.0, 1.0, 0.0],
['male', 0.0, 1.0, 0.0],
['male', 0.0, 1.0, 0.0],
['male', 0.0, 1.0, 0.0],
['female', 1.0, 0.0, 0.0],
['female', 1.0, 0.0, 0.0],
['male', 0.0, 1.0, 0.0],
You can see the categorical is in the result. I try to drop it, but it's still here. So I just want to remove categorical in this result to run StandardScaler. I don't understand why it doesn't work. Thank you for reading.
CodePudding user response:
With ColumnTransformer
you cannot perform sequential information on the different columns. This object will perform the first operation defined for a given column and then mark it as preprocessed.
Therefore in your example, categorical columns will only be imputed but will not be One-hot encoded.
To perform this operation (Imputing and One-hot Encoding on columns you should put these preprocessing on a Pipeline
to perform them sequentially.
The example below is illustrating how to handle different processing for numerical and categorical features.
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
X = pd.DataFrame({'gender' : ['male', 'male', 'female'],
'A' : [1, 10 , 20],
'B' : [1, 150 , 20]})
categorical_preprocessing = Pipeline(
[
('imp_mode', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore')),
])
numerical_preprocessing = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
])
preprocessing = ColumnTransformer(
[
('catecorical', categorical_preprocessing,
make_column_selector(dtype_include=object)),
('numerical', numerical_preprocessing,
make_column_selector(dtype_include=np.number)),
])
preprocessing.fit_transform(X)
Output:
array([[ 0. , 1. , -1.20270298, -0.84570663],
[ 0. , 1. , -0.04295368, 1.40447708],
[ 1. , 0. , 1.24565666, -0.55877045]])