I want to build a whole pipeline with custom transformers, but I've found that some of the transformers I've built work perfectly fine when cross-validation is not included i.e.:
pipe.fit(X_train, y_train)
WORKS, while GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1).fit(X_train, y_train)
results in ERROR.
In the example, I'm using OneHotEncoder, which works fine when used along with make_column_transformer
or ColumnTransformer
, but when it's put in the custom transformer it doesn't.
Code:
BASE_TREE_MODEL = RandomForestRegressor()
class data_get_dummies(BaseEstimator, TransformerMixin):
def __init__(self, columns:list = CATEGORICAL_FEATURES):
self.columns = columns
self.encoder = make_column_transformer((OneHotEncoder(handle_unknown="ignore", sparse=False), self.columns),remainder='passthrough')
def fit(self, X, y = None):
self.encoder.fit(X)
return self
def transform(self, X, y = None) -> pd.DataFrame:
X_ = X.copy()
df_temp=pd.DataFrame(self.encoder.fit_transform(X_), columns=self.encoder.get_feature_names_out())
return df_temp
data_get_dummies_ = make_column_transformer((OneHotEncoder(handle_unknown="ignore", sparse=False), CATEGORICAL_FEATURES),remainder='passthrough')
pipe = Pipeline([
('start', data_get_dummies()),
('model', BASE_TREE_MODEL)
])
param_grid = dict()
grid_search = GridSearchCV(pipe, param_grid, cv=5, verbose=1, n_jobs=-1)
cv_model = grid_search.fit(X_train, y_train)
print('Pipeline:')
print(cv_model.best_estimator_)
print('----------------------')
print('Score:')
print(cv_model.best_score_)
Error:
/Users/simado/opt/anaconda3/envs/tensorflow/lib/python3.10/site-packages/sklearn/base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names unseen at fit time:
- onehotencoder__Pastato energijos suvartojimo klase:_E
- onehotencoder__Pastato tipas:_Karkasinis
- onehotencoder__Sildymas:_Geoterminis, kita, centrinis kolektorinis
Feature names seen at fit time, yet now missing:
- onehotencoder__Artimiausi darzeliai_3_Viesoji istaiga "Sarmatika"
- onehotencoder__Artimiausios mokyklos_3_Viesoji istaiga "Sarmatika"
- onehotencoder__Artimiausios parduotuves_3_Viesoji istaiga "Sarmatika"
- onehotencoder__Artimiausios stoteles_3_Viesoji istaiga "Sarmatika"
- onehotencoder__Gatve_Virsilu g.
- ...
I've been trying to figure this out for a few days now and can't figure it out :(
CodePudding user response:
The issue is that you are refitting your OneHotEncoder
when you call transform
:
df_temp=pd.DataFrame(self.encoder.fit_transform(X_),
columns=self.encoder.get_feature_names_out())
Thus, when you encounter unseen values for your categorical features in testing/CV, your output will have different dimensions than in training, and an error will be raised. You should not retrain your encoder in testing, just transform:
df_temp=pd.DataFrame(self.encoder.transform(X_),
columns=self.encoder.get_feature_names_out())
CodePudding user response:
I think the transform method is supposed to only transform
not fit_transform
Try this please:
BASE_TREE_MODEL = RandomForestRegressor()
class data_get_dummies(BaseEstimator, TransformerMixin):
def __init__(self, columns:list = CATEGORICAL_FEATURES):
self.columns = columns
self.encoder = make_column_transformer((OneHotEncoder(handle_unknown="ignore", sparse=False), self.columns),remainder='passthrough')
def fit(self, X, y = None):
self.encoder.fit(X)
return self
def transform(self, X, y = None) -> pd.DataFrame:
X_ = X.copy()
df_temp=pd.DataFrame(self.encoder.transform(X_), columns=self.encoder.get_feature_names_out())
return df_temp