Using Pipeline in a cross_validate() function for testing different ML algorithms-CodePudding

I have a dataset that contains 17 features (x) and binary classification results (y). I already prepared the dataset and performed train_test_split() on it. I'm using the following script to run different ML algorithms on the dataset to compare between them:

def run_exps(X_train: pd.DataFrame , y_train: pd.DataFrame, X_test: pd.DataFrame, y_test: pd.DataFrame) -> pd.DataFrame:
  
  # Lightweight script to test many models and find winners
  # :param X_train: training split
  # :param y_train: training target vector
  # :param X_test: test split
  # :param y_test: test target vector
  # :return: DataFrame of predictions

  models = [
            ('LogReg', LogisticRegression()),
            ('RF', RandomForestClassifier()),
            ('KNN - Euclidean', KNeighborsClassifier(metric='euclidean')),
            ('SVM', SVC()),
            ('XGB', XGBClassifier(use_label_encoder =False, eval_metric='error'))
            ]

  names = []
  scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc']

  # For Loop that takes each model and perform training, cross validation, prediction and evaluation
  for name, model in models:

    # Making pipleline that normalize, oversmaple the dataset
    pipe = Pipeline([
            ('normalization', MinMaxScaler()),
            ('oversampling', SMOTE())
    ])

    kfold = StratifiedKFold(n_splits=5)

    # How can I call the pipeline inside the cross_validate() Function ?
    cv_results = cross_validate(model, X_train, y_train, cv=kfold, scoring=scoring, verbose=3)

    clf = model.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    print('''
    {}
    {}
    {}
    ''' .format(name, classification_report(y_test, y_pred), confusion_matrix(y_test, y_pred)))

    names.append(name)

I have noticed that the data that I'm using needs to be normalized and oversampled before I run the script.

However, since I'm using cross_validate() function inside the script, I need to perform normalization and oversampling with each fold.

In order to do so I have created a pipeline (that normalizes and oversamples the dataset) inside the for loop (that takes each model and perform training, cross validation, prediction and evaluation) but I'm not sure how to call the pipeline since the estimator parameter in cross_validate() already takes the model variable to perform the prediction based on it.

What should I do in this case ?

CodePudding user response：

You could integrate your model within your pipeline and then call cross_validate on your pipeline as follow:

pipe = Pipeline([
            ('normalization', MinMaxScaler()),
            ('oversampling', SMOTE()),
            ('name', model)
])

cv_results = cross_validate(pipe, X_train, y_train, cv=kfold, scoring=scoring, verbose=3)