Home > Back-end >  Is it necessary to use cross-validation after data is split using StratifiedShuffleSplit?
Is it necessary to use cross-validation after data is split using StratifiedShuffleSplit?

Time:09-18

I used StratifiedShuffleSplit to split the data and now I am wondering if I need to use cross-validation again as I go for building the classification model(Logistic Regression,KNN,Random Forest etc.) I am confused about it because reading the documentation in Sklearn I get the impression that StratifiedShuffleSplit is a mix of splitting the data and cross-validating it at the same time.

CodePudding user response:

StratifiedShuffleSplit provides you just a list with train/test indices. How it will be used depends on you.

  1. You can fit the model with train set and predict on the test and calculate the score manually - so implementing cross validation by yourself
  2. Or you can use cross_val_score and pass StratifiedShuffleSplit() to it and cross_val_score will do the same thing.

Example:

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedShuffleSplit, cross_val_score

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 0, 1, 1, 1])

sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
model = RandomForestClassifier(n_estimators=1, random_state=1)

# Calculate scores automatically
accuracy_per_split = cross_val_score(model, X, y, scoring="accuracy", cv=sss, n_jobs=1)
print(f"Accuracies per splits: {accuracy_per_split}")

# Calculate scores manually
accuracy_per_split = []
for train_index, test_index in sss.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    acc = accuracy_score(y_test, y_pred)
    accuracy_per_split.append(acc)

print(f"Accuracies per splits: {accuracy_per_split}")

  • Related