I think there are repetitive combinations, no?-CodePudding

I am reading the book Hands-on Machine Learning by Aurélien Géron, and in the second chapter at page 142 he wrote the following code about hyperparameter tuning combinations:

param_grid = [
  {'preprocessing__geo__n_clusters': [5, 8, 10],
   'random_forest__max_features': [4, 6, 8]},
  {'preprocessing__geo__n_clusters': [10, 15],
   'random_forest__max_features': [6, 8, 10]},
]

I think there are repetitive combinations, or am I missing something?

CodePudding user response：

Yes, this grid contains duplicates.

You can check by enumerating them:

from sklearn.model_selection import ParameterGrid

param_grid = [
    {"preprocessing__geo__n_clusters": [5, 8, 10],
     "random_forest__max_features": [4, 6, 8]},
    {"preprocessing__geo__n_clusters": [10, 15],
     "random_forest__max_features": [6, 8, 10]},
]

for params in ParameterGrid(param_grid=param_grid):
    print(params)

{'preprocessing__geo__n_clusters': 5, 'random_forest__max_features': 4}
...
{'preprocessing__geo__n_clusters': 10, 'random_forest__max_features': 6}
{'preprocessing__geo__n_clusters': 10, 'random_forest__max_features': 8}
{'preprocessing__geo__n_clusters': 10, 'random_forest__max_features': 6}
{'preprocessing__geo__n_clusters': 10, 'random_forest__max_features': 8}
...
{'preprocessing__geo__n_clusters': 15, 'random_forest__max_features': 10}

CodePudding user response：

(Alternate answer for people reading the Second Edition).

I think this was an error which was corrected in the 2nd Edition of Aurélien Géron's "Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow."

The 2nd Edition describes Grid Search with cross validation on p. 76 of Chapter 2, writing:

from sklearn.model_selection import GridSearchCV

param_grid = [
  {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
  {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]

forest_reg = RandomForestRegressor()

grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error', return_train_score=True)

grid_search.fit(housing_prepared, housing_labels)

Since the bootstrap=True is the default, the updated param_grid does not have this issue:

{'max_features': 2, 'n_estimators': 3}
{'max_features': 2, 'n_estimators': 10}
...
{'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
{'bootstrap': False, 'max_features': 4, 'n_estimators': 10}