I have a problem. I want to use StandardScaler()
, but my dataset contains certain OneHotEncoding
values and other values that should be not be scaled. But if I'm running the StandardScaler()
all the values are scaled. So is there an option to run this method only on certain values inside a pipeline?
I found this question: One-Hot-Encode categorical variables and scale continuous ones simultaneouely with the below code
columns = ['rank']
columns_to_scale = ['gre', 'gpa']
scaler = StandardScaler()
ohe = OneHotEncoder(sparse=False)
# Concatenate (Column-Bind) Processed Columns Back Together
processed_data = np.concatenate([scaled_columns, encoded_columns], axis=1)
So is there an option to only run the StandardScaler()
inside a pipeline
on only certain values and the other values should be merged to the scaled values?
So the pipeline should only use StandardScaler on the values 'xy', 'xyz'
.
StandardScaler Class
from sklearn.base import BaseEstimator, TransformerMixin
class StandardScaler_with_certain_features(BaseEstimator, TransformerMixin):
def __init__(self, columns_to_scale):
scaler = StandardScaler()
def fit(self, X, y = None):
scaler.fit(X_train) # only std.fit on train set
X_train_nor = scaler.transform(X_train.values)
def transform(self, X, y = None):
return X
Pipeline
columns_to_scale = ['xy', 'xyz']
steps = [('standard_scaler', StandardScaler_with_certain_features(columns_to_scale)),
('feature_selection', SelectFromModel(estimator=LogisticRegression(max_iter=100))),
('lasso', Lasso(alpha=0.03))]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)
parameteres = { }
grid = GridSearchCV(pipeline, param_grid=parameteres, cv=5)
grid.fit(X_train, y_train)
print("score = %3.2f" %(grid.score(X_test,y_test)))
print('Training set score: ' str(grid.score(X_train,y_train)))
print('Test set score: ' str(grid.score(X_test,y_test)))
# Prediction
y_pred = grid.predict(X_test)
print("RMSE Val:", metrics.mean_squared_error(y_test, y_pred, squared=False))
CodePudding user response:
You can include a ColumnTransformer
in the Pipeline
in order to apply the StandardScaler
only to certain columns. You need to set remainder='passthrough
to make sure that the columns that are not scaled are concatenated with the ones that are scaled.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso
df = pd.DataFrame({
'y': np.random.normal(0, 1, 100),
'x': np.random.normal(0, 1, 100),
'z': np.random.normal(0, 1, 100),
'xy': np.random.normal(2, 3, 100),
'xyz': np.random.normal(4, 5, 100),
})
X = df.drop(labels=['y'], axis=1)
y = df['y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)
preprocessor = ColumnTransformer(
transformers=[('scaler', StandardScaler(), ['xy', 'xyz'])],
remainder='passthrough'
)
pipeline = Pipeline([
('preprocessor', preprocessor),
('lasso', Lasso(alpha=0.03))
])
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)