Home > Enterprise >  Searching hyperparameters for models with group based validation methods in sklearn
Searching hyperparameters for models with group based validation methods in sklearn

Time:12-11

I would like to perform hyperparamter optimisation for a model I have trained in Scikit-Learn. I want to first use a random search to get an idea of a good area to search in and then follow it up with a grid search. The method of validation I need to use is Leave One Group Out (LOGO).

So something to this effect:

distributions = {
    "n_estimators": randint(low=1, high=500),
    "criterion": ["squared_error", "absolute_error", "poisson"],
    "max_depth": randint(low=1, high=100)
}

random_search = RandomizedSearchCV(
    forest_reg, 
    distributions, 
    cv=LeaveOneGroupOut(), 
    groups=group, 
    scoring="neg_mean_squared_error", 
    return_train_score=True, 
    random_state=42,
    n_jobs=-1,
    n_iter=20
)

random_search.fit(X, y)

Neither RandomizedSearchCV or GridSearchCV offer support for LOGO validation with definition of groups. When I use a method such as cross_val_score() I can send in a chosen cross validation method like so

scores = cross_val_score(
    forest_reg, 
    X, 
    y, 
    scoring="neg_mean_squared_error", 
    cv=LeaveOneGroupOut(), 
    groups=group, 
    n_jobs=-1
)

Is there a reason that the same is not supported with either of the hyperparameter search methods? Am I using the API in the wrong way? Is there a way to achieve what I want using sklearn, without cludging something together myself?

CodePudding user response:

Groups should be passed into the fit() method when using LeaveOneGroupOut.

RandomizedSearchCV.fit() documentation specify that the parameter groups should be used only in conjunction with a “Group” cv instance such as GroupKFold or LeaveOneGroupOut.

See example below:

from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV, LeaveOneGroupOut
import numpy as np

params = {
    "n_estimators": [1, 5, 10],
    "max_depth": [2, 5, 10]
}

X, y = make_regression()
groups = np.random.randint(5, size=y.shape)
cv = RandomizedSearchCV(RandomForestRegressor(),
                        params,
                        cv=LeaveOneGroupOut()
)

cv.fit(X, y, groups=groups)
  • Related