Searching hyperparameters for models with group based validation methods in sklearn-CodePudding

I would like to perform hyperparamter optimisation for a model I have trained in Scikit-Learn. I want to first use a random search to get an idea of a good area to search in and then follow it up with a grid search. The method of validation I need to use is Leave One Group Out (LOGO).

So something to this effect:

distributions = {
    "n_estimators": randint(low=1, high=500),
    "criterion": ["squared_error", "absolute_error", "poisson"],
    "max_depth": randint(low=1, high=100)
}

random_search = RandomizedSearchCV(
    forest_reg, 
    distributions, 
    cv=LeaveOneGroupOut(), 
    groups=group, 
    scoring="neg_mean_squared_error", 
    return_train_score=True, 
    random_state=42,
    n_jobs=-1,
    n_iter=20
)

random_search.fit(X, y)

Neither RandomizedSearchCV or GridSearchCV offer support for LOGO validation with definition of groups. When I use a method such as cross_val_score() I can send in a chosen cross validation method like so

scores = cross_val_score(
    forest_reg, 
    X, 
    y, 
    scoring="neg_mean_squared_error", 
    cv=LeaveOneGroupOut(), 
    groups=group, 
    n_jobs=-1
)

Is there a reason that the same is not supported with either of the hyperparameter search methods? Am I using the API in the wrong way? Is there a way to achieve what I want using sklearn, without cludging something together myself?

CodePudding user response：

Groups should be passed into the fit() method when using LeaveOneGroupOut.

RandomizedSearchCV.fit() documentation specify that the parameter groups should be used only in conjunction with a “Group” cv instance such as GroupKFold or LeaveOneGroupOut.

See example below:

from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV, LeaveOneGroupOut
import numpy as np

params = {
    "n_estimators": [1, 5, 10],
    "max_depth": [2, 5, 10]
}

X, y = make_regression()
groups = np.random.randint(5, size=y.shape)
cv = RandomizedSearchCV(RandomForestRegressor(),
                        params,
                        cv=LeaveOneGroupOut()
)

cv.fit(X, y, groups=groups)