I have created a pipeline using sklearn so that multiple models will go through it. Since there is vectorization before fitting the model, I wonder if this vectorization is performed always before the model fitting process? If yes, maybe I should take this preprocessing out of the pipeline.
log_reg = LogisticRegression()
rand_for = RandomForestClassifier()
lin_svc = LinearSVC()
svc = SVC()
# The pipeline contains both vectorization model and classifier
pipe = Pipeline(
[
('vect', tfidf),
('classifier', log_reg)
]
)
# params dictionary example
params_log_reg = {
'classifier__penalty': ['l2'],
'classifier__C': [0.01, 0.1, 1.0, 10.0, 100.0],
'classifier__class_weight': ['balanced', class_weights],
'classifier__solver': ['lbfgs', 'newton-cg'],
# 'classifier__verbose': [2],
'classifier': [log_reg]
}
params = [params_log_reg, params_rand_for, params_lin_svc, params_svc] # param dictionaries for each model
# Grid search for to combine it all
grid = GridSearchCV(
pipe,
params,
cv=skf,
scoring= 'f1_weighted')
grid.fit(features_train, labels_train[:,0])
CodePudding user response:
When you are running a GridSearchCV
, pipeline steps will be recomputed for every combination of hyperparameters
. So yes, this vectorization process will be done every time the pipeline is called.
Have a look at the sklearn Pipeline and composite estimators.
To quote:
Fitting transformers may be computationally expensive. With its memory parameter set, Pipeline will cache each transformer after calling fit. This feature is used to avoid computing the fit transformers within a pipeline if the parameters and input data are identical. A typical example is the case of a grid search in which the transformers can be fitted only once and reused for each configuration.
So you can use the memory
flag to cache the transformers.
cachedir = mkdtemp()
pipe = Pipeline(estimators, memory=cachedir)