Home > Software engineering >  CV=integer vs predefined splits in GridSearchCV
CV=integer vs predefined splits in GridSearchCV

Time:04-01

What's the difference between setting CV=some integer vs cv=PredefinedSplit(test_fold=your_test_fold)?

Is there any advantage of one over the other? Does CV=some integer sets the splits randomly?

CodePudding user response:

Specifying an integer will produce kfold cross-validation without shuffling, as described in the documentation for sklearn.model_selection.KFold. Shuffling before splitting may or may not be preferred; if your data is sorted, shuffling is necessary to randomize the distribution of samples, while if the samples are simply correlated due to spatial or temporal sampling effects, shuffling may provide an optimistic view of performance.

I would avoid using PredefinedSplit unless you have a very good reason to predefine your splits. There are other CV generators that can probably meet your needs, like StratifiedKFold if you want to maintain your class distribution (for example.)

  • Related