Home > Net >  Custom Sklearn Transformer changes the shape of X in the GridSearchCV
Custom Sklearn Transformer changes the shape of X in the GridSearchCV

Time:03-26

I want to build a whole pipeline with custom transformers, but I've found that some of the transformers I've built work perfectly fine when cross-validation is not included i.e.: pipe.fit(X_train, y_train) WORKS, while GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1).fit(X_train, y_train) results in ERROR.

In the example, I'm using OneHotEncoder, which works fine when used along with make_column_transformer or ColumnTransformer, but when it's put in the custom transformer it doesn't.

Code:

BASE_TREE_MODEL = RandomForestRegressor()

class data_get_dummies(BaseEstimator, TransformerMixin):
    def __init__(self, columns:list = CATEGORICAL_FEATURES):
        self.columns = columns
        self.encoder = make_column_transformer((OneHotEncoder(handle_unknown="ignore", sparse=False), self.columns),remainder='passthrough')
    def fit(self, X, y = None):
        self.encoder.fit(X)
        return self
    def transform(self, X, y = None) -> pd.DataFrame:
        X_ = X.copy()
        df_temp=pd.DataFrame(self.encoder.fit_transform(X_), columns=self.encoder.get_feature_names_out())
        return df_temp

data_get_dummies_ = make_column_transformer((OneHotEncoder(handle_unknown="ignore", sparse=False), CATEGORICAL_FEATURES),remainder='passthrough')


pipe = Pipeline([
                ('start', data_get_dummies()),
                ('model', BASE_TREE_MODEL)
                ])

param_grid = dict()
grid_search = GridSearchCV(pipe, param_grid, cv=5, verbose=1, n_jobs=-1)
cv_model = grid_search.fit(X_train, y_train)

print('Pipeline:')
print(cv_model.best_estimator_)
print('----------------------')
print('Score:')
print(cv_model.best_score_)

Error:

/Users/simado/opt/anaconda3/envs/tensorflow/lib/python3.10/site-packages/sklearn/base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names unseen at fit time:
- onehotencoder__Pastato energijos suvartojimo klase:_E
- onehotencoder__Pastato tipas:_Karkasinis
- onehotencoder__Sildymas:_Geoterminis, kita, centrinis kolektorinis
Feature names seen at fit time, yet now missing:
- onehotencoder__Artimiausi darzeliai_3_Viesoji istaiga "Sarmatika"
- onehotencoder__Artimiausios mokyklos_3_Viesoji istaiga "Sarmatika"
- onehotencoder__Artimiausios parduotuves_3_Viesoji istaiga "Sarmatika"
- onehotencoder__Artimiausios stoteles_3_Viesoji istaiga "Sarmatika"
- onehotencoder__Gatve_Virsilu g.
- ...

I've been trying to figure this out for a few days now and can't figure it out :(

CodePudding user response:

The issue is that you are refitting your OneHotEncoder when you call transform:

df_temp=pd.DataFrame(self.encoder.fit_transform(X_), 
                     columns=self.encoder.get_feature_names_out())

Thus, when you encounter unseen values for your categorical features in testing/CV, your output will have different dimensions than in training, and an error will be raised. You should not retrain your encoder in testing, just transform:

df_temp=pd.DataFrame(self.encoder.transform(X_),
                     columns=self.encoder.get_feature_names_out())

CodePudding user response:

I think the transform method is supposed to only transform not fit_transform

Try this please:

BASE_TREE_MODEL = RandomForestRegressor()

class data_get_dummies(BaseEstimator, TransformerMixin):
    def __init__(self, columns:list = CATEGORICAL_FEATURES):
        self.columns = columns
        self.encoder = make_column_transformer((OneHotEncoder(handle_unknown="ignore", sparse=False), self.columns),remainder='passthrough')
    def fit(self, X, y = None):
        self.encoder.fit(X)
        return self
    def transform(self, X, y = None) -> pd.DataFrame:
        X_ = X.copy()
        df_temp=pd.DataFrame(self.encoder.transform(X_), columns=self.encoder.get_feature_names_out())
        return df_temp
  • Related