I was trying to use ColumnTransformer from sklearn to do preprocess data (normalize and convert a column to ordinal). However, I encountered some issues where the column that I specify not to standardize gets standardized.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
import numpy as np
import os
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
data = {'Gender': ['F','F','M','F','F','F'],
'Age': [81,85,71,75,67,83],
'Educ': [8.0,4.0,14.0, 16.0, 12.0, 3.0],
'CDR': [0.5,0.0,1.0,0.5,0.0,0.0],
'eTIV': [1459,1460,1332,1314,1331,1335],
'nWBV': [0.694,0.754,0.679,0.760,0.761,0.720],
'ASF': [1.203,1.202,1.317,1.335,1.318,1.314]
}
df = pd.DataFrame(data=data)
num_transformer = Pipeline(
steps=[("scaler", StandardScaler())]
)
cat_transformer = Pipeline(
steps=[("ordinal", OrdinalEncoder())]
)
num_attribs = list(df.drop(['Gender', 'CDR'], axis = 1))
ord_attribs = ['Gender']
pipeline = ColumnTransformer([
("num", num_transformer, num_attribs),
("cat", cat_transformer, ord_attribs)
], remainder='passthrough')
df_prepared = pipeline.fit_transform(df)
The output of the first column (should be ordinal) is
array([ 0.61586304, 1.02208403, -0.39968943, 0.00653156, -0.80591041,
0.81897354])
as well as the 'CDR' column should remain 0.5, 0.0, 1.0, 0.5, 0.0, 0.0 but returns
array([-1.00490768, 0.39479035, -1.35483219, 0.53476015, 0.55808845,
-0.39837187])
Desired output is
Gender: 0,0,1,0,0,0
Educ: standardized numbers
CDR: 0.5,0.0,1.0,0.5,0.0,0.0
eTIV: standardized numbers
nWBV: standardized numbers
ASF: standardized numbers
I really appreciate ahead for any advice.
CodePudding user response:
The transformer is right. Check the last column:
>>> df_prepared[:, -1]
array([0.5, 0. , 1. , 0.5, 0. , 0. ])
Reconstruct a DataFrame:
out = pd.DataFrame(df_prepared, columns=pipeline.get_feature_names_out())
print(out)
# Output
num__Age num__Educ num__eTIV num__nWBV num__ASF cat__Gender remainder__CDR
0 0.612372 -0.306719 1.397971 -1.040221 -1.395220 0.0 0.5
1 1.224745 -1.124637 1.414009 0.795463 -1.412994 0.0 0.0
2 -0.918559 0.920158 -0.638843 -1.499142 0.630959 1.0 1.0
3 -0.306186 1.329116 -0.927526 0.979031 0.950883 0.0 0.5
4 -1.530931 0.511199 -0.654881 1.009626 0.648733 0.0 0.0
5 0.918559 -1.329116 -0.590730 -0.244758 0.577639 0.0 0.0