Home > database >  ColumnTransformer is normalizing columns that I set to not normalize
ColumnTransformer is normalizing columns that I set to not normalize

Time:02-04

I was trying to use ColumnTransformer from sklearn to do preprocess data (normalize and convert a column to ordinal). However, I encountered some issues where the column that I specify not to standardize gets standardized.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
import numpy as np
import os
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline

data = {'Gender': ['F','F','M','F','F','F'], 
'Age': [81,85,71,75,67,83], 
'Educ': [8.0,4.0,14.0, 16.0, 12.0, 3.0], 
'CDR': [0.5,0.0,1.0,0.5,0.0,0.0], 
'eTIV': [1459,1460,1332,1314,1331,1335],
'nWBV': [0.694,0.754,0.679,0.760,0.761,0.720],
'ASF': [1.203,1.202,1.317,1.335,1.318,1.314]
}

df = pd.DataFrame(data=data)

num_transformer = Pipeline(
    steps=[("scaler", StandardScaler())]
)

cat_transformer = Pipeline(
    steps=[("ordinal", OrdinalEncoder())]
)

num_attribs = list(df.drop(['Gender', 'CDR'], axis = 1))
ord_attribs = ['Gender']

pipeline = ColumnTransformer([
    ("num", num_transformer, num_attribs),
    ("cat", cat_transformer, ord_attribs)
], remainder='passthrough')

df_prepared = pipeline.fit_transform(df)

The output of the first column (should be ordinal) is

array([ 0.61586304,  1.02208403, -0.39968943,  0.00653156, -0.80591041,
        0.81897354])

as well as the 'CDR' column should remain 0.5, 0.0, 1.0, 0.5, 0.0, 0.0 but returns

array([-1.00490768,  0.39479035, -1.35483219,  0.53476015,  0.55808845,
       -0.39837187])

Desired output is

Gender: 0,0,1,0,0,0
Educ: standardized numbers
CDR: 0.5,0.0,1.0,0.5,0.0,0.0
eTIV: standardized numbers
nWBV: standardized numbers
ASF: standardized numbers

I really appreciate ahead for any advice.

CodePudding user response:

The transformer is right. Check the last column:

>>> df_prepared[:, -1]
array([0.5, 0. , 1. , 0.5, 0. , 0. ])

Reconstruct a DataFrame:

out = pd.DataFrame(df_prepared, columns=pipeline.get_feature_names_out())
print(out)

# Output
   num__Age  num__Educ  num__eTIV  num__nWBV  num__ASF  cat__Gender  remainder__CDR
0  0.612372  -0.306719   1.397971  -1.040221 -1.395220          0.0             0.5
1  1.224745  -1.124637   1.414009   0.795463 -1.412994          0.0             0.0
2 -0.918559   0.920158  -0.638843  -1.499142  0.630959          1.0             1.0
3 -0.306186   1.329116  -0.927526   0.979031  0.950883          0.0             0.5
4 -1.530931   0.511199  -0.654881   1.009626  0.648733          0.0             0.0
5  0.918559  -1.329116  -0.590730  -0.244758  0.577639          0.0             0.0
  • Related