I'm running grid search on AdaBoost with DecisionTreeClassifier as its base learner to get the best parameters for AdaBoost and DecisionTree.
The search on a dataset (130000, 22)
has been running for 18 hours so I'm wondering if it's just another typical day of waiting for training or maybe there might be an issue with the set up.
Does this code look ok? Maybe the min_samples_leaf
search range is too wide?
ada_params = {"base_estimator__criterion" : ["gini", "entropy"],
"base_estimator__splitter" : ["best", "random"],
"base_estimator__min_samples_leaf": [*np.arange(100,1500,100)],
"base_estimator__max_depth": [5,10,13,15],
"base_estimator__max_features": [5,10,15],
"n_estimators": [500, 700, 1000, 1500],
"learning_rate": [0.001, 0.01, 0.1, 0.3]
}
dt_base_learner = DecisionTreeClassifier(random_state = 42, max_features="auto", class_weight = "balanced")
ada_clf = AdaBoostClassifier(base_estimator = dt_base_learner)
ada_estimator = GridSearchCV(ada_clf, param_grid=ada_params, scoring = 'f1', cv=kf)
ada_estimator.fit(scaled_X_train, y_train)
CodePudding user response:
If I am not mistaken, your GridSearch tests 14 * 4 * 3 * 4 * 4 = 2,688
different model configuration, each for a crossvalidation of an unknown number of splits. You should definitely try to reduce the number of combinations in the GridSearchCV
or go for RandomizedSearchCV
or BayesSearchCV
from skopt
.
CodePudding user response:
Gridsearch will not finish until all joins are done, check the RandomSearchcv documentation and increase the joins a few at a time (n_iter) and put "-1" in "n_jobs" to parallelize as much as possible
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html