Home > database >  Is preprocessing repeated in a Pipeline each time a new ML model is loaded?
Is preprocessing repeated in a Pipeline each time a new ML model is loaded?

Time:08-17

I have created a pipeline using sklearn so that multiple models will go through it. Since there is vectorization before fitting the model, I wonder if this vectorization is performed always before the model fitting process? If yes, maybe I should take this preprocessing out of the pipeline.

log_reg = LogisticRegression()
rand_for = RandomForestClassifier()
lin_svc = LinearSVC()
svc = SVC()

# The pipeline contains both vectorization model and classifier
pipe = Pipeline(
    [
        ('vect', tfidf),
        ('classifier', log_reg)
        ]
    )

# params dictionary example
params_log_reg = {
    'classifier__penalty': ['l2'],
    'classifier__C': [0.01, 0.1, 1.0, 10.0, 100.0],
    'classifier__class_weight': ['balanced', class_weights],
    'classifier__solver': ['lbfgs', 'newton-cg'],
    # 'classifier__verbose': [2],
    'classifier': [log_reg]
}

params = [params_log_reg, params_rand_for, params_lin_svc, params_svc] # param dictionaries for each model

# Grid search for to combine it all
grid = GridSearchCV(
    pipe, 
    params,
    cv=skf,
    scoring= 'f1_weighted')

grid.fit(features_train, labels_train[:,0])

CodePudding user response:

When you are running a GridSearchCV, pipeline steps will be recomputed for every combination of hyperparameters. So yes, this vectorization process will be done every time the pipeline is called.

Have a look at the sklearn Pipeline and composite estimators.

To quote:

Fitting transformers may be computationally expensive. With its memory parameter set, Pipeline will cache each transformer after calling fit. This feature is used to avoid computing the fit transformers within a pipeline if the parameters and input data are identical. A typical example is the case of a grid search in which the transformers can be fitted only once and reused for each configuration.

So you can use the memory flag to cache the transformers.

cachedir = mkdtemp()
pipe = Pipeline(estimators, memory=cachedir)
  • Related