In the context of model selection for a classification problem, while running cross validation, is it ok to specify n_jobs=-1
both in model specification and cross validation function in order to take full advantage of the power of the machine?
For example, comparing sklearn RandomForestClassifier and xgboost XGBClassifier:
RF_model = RandomForestClassifier( ..., n_jobs=-1)
XGB_model = XGBClassifier( ..., n_jobs=-1)
RF_cv = cross_validate(RF_model, ..., n_jobs=-1)
XGB_cv = cross_validate(XGB_model, ..., n_jobs=-1)
is it ok to specify the parameters in both? Or should I specify it only once? And in which of them, model or cross validation statement?
I used for the example models from two different libraries (sklearn and xgboost) because maybe there is a difference in how it works, also cross_validate
function is from sklearn.
CodePudding user response:
Each class with be run using all available cores. By setting n_jobs=-1
, you are asking for the maximum number of CPU cores to be used in parallel to train and evaluate the models. This can speed up the training and evaluation process, especially if you have a large dataset.
When you specify n_jobs=-1 both as a parameter of cross_validate function and as a parameter of the chosen model, the n_jobs argument will take the value that is passed to cross_validate. So, in this case, having n_jobs specified twice is redundant and the value passed to cross_validate will be used. You can safely remove the n_jobs argument from the model parameters, as it is already passed to cross_validate.
In other words, specifying n_jobs twice doesn't cause any harm, but it's unnecessary and can lead to confusion, so it's recommended to specify it only once in the appropriate place.