Running GridSearch on AdaBoost and base learner-CodePudding

I'm running grid search on AdaBoost with DecisionTreeClassifier as its base learner to get the best parameters for AdaBoost and DecisionTree.

The search on a dataset (130000, 22) has been running for 18 hours so I'm wondering if it's just another typical day of waiting for training or maybe there might be an issue with the set up.

Does this code look ok? Maybe the min_samples_leaf search range is too wide?

ada_params = {"base_estimator__criterion" : ["gini", "entropy"],
              "base_estimator__splitter" :   ["best", "random"],
              "base_estimator__min_samples_leaf": [*np.arange(100,1500,100)],
              "base_estimator__max_depth": [5,10,13,15],
              "base_estimator__max_features": [5,10,15],
              "n_estimators": [500, 700, 1000, 1500],
              "learning_rate": [0.001, 0.01, 0.1, 0.3]
}

dt_base_learner = DecisionTreeClassifier(random_state = 42, max_features="auto", class_weight = "balanced")
ada_clf = AdaBoostClassifier(base_estimator = dt_base_learner)

ada_estimator = GridSearchCV(ada_clf, param_grid=ada_params, scoring = 'f1', cv=kf)
ada_estimator.fit(scaled_X_train, y_train)

CodePudding user response：

If I am not mistaken, your GridSearch tests 14 * 4 * 3 * 4 * 4 = 2,688 different model configuration, each for a crossvalidation of an unknown number of splits. You should definitely try to reduce the number of combinations in the GridSearchCV or go for RandomizedSearchCV or BayesSearchCV from skopt.

CodePudding user response：

Gridsearch will not finish until all joins are done, check the RandomSearchcv documentation and increase the joins a few at a time (n_iter) and put "-1" in "n_jobs" to parallelize as much as possible

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html