Home > Enterprise >  Python : GridSearchCV taking too long to finish running
Python : GridSearchCV taking too long to finish running

Time:05-06

I'm attempting to do a grid search to optimize my model but it's taking far too long to execute. My total dataset is only about 15,000 observations with about 30-40 variables. I was successfully able to run a random forest through the gridsearch which took about an hour and a half but now that I've switched to SVC it's already ran for over 9 hours and it's still not complete. Below is a sample of my code for the cross validation:

from sklearn.model_selection import GridSearchCV
from sklearn import svm
from sklearn.svm import SVC

SVM_Classifier= SVC(random_state=7)



param_grid = {'C': [0.1, 1, 10, 100],
              'gamma': [1,0.1,0.01,0.001],
              'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
              'degree' : [0, 1, 2, 3, 4, 5, 6]}

grid_obj = GridSearchCV(SVM_Classifier,
                        
                        return_train_score=True,
                        param_grid=param_grid,
                        scoring='roc_auc',
                        cv=3,
                       n_jobs = -1)

grid_fit = grid_obj.fit(X_train, y_train)
SVMC_opt = grid_fit.best_estimator_

print('='*20)
print("best params: "   str(grid_obj.best_estimator_))
print("best params: "   str(grid_obj.best_params_))
print('best score:', grid_obj.best_score_)
print('='*20)

I have already reduced the cross validation from 10 to 3, and I'm using n_jobs=-1 so I'm engaging all of my cores. Is there anything else I'm missing that I can do here to speed up the process?

CodePudding user response:

Unfortunately, SVC's fit algorithm is O(n^2) at best, so it indeed is extremely slow. Even the documentation suggests to use LinearSVC above ~10k samples and you are right in that ballpark.

Maybe try to increase the kernel cache_size. I would suggest timing a single SVC fit with different cache sizes to see whether you can gain something.

EDIT: by the way, you are needlessly computing a lot of SVC fits with different degree parameter values, where that will be ignored (all the kernels but poly). I suggest splitting the runs for poly and the other kernels, you will save a lot of time.

CodePudding user response:

While exploring LinearSVC might be a good choice (and you should clean up the parameter combinations as noted in the other answer), you could also use a GPU accelerated SVC estimator in RAPIDS cuML on a GPU-enabled cloud instance of your choice (or locally if you have an NVIDIA GPU). This estimator can be dropped directly into your GridSearchCV function if you use the default n_jobs=1. (Disclaimer: I work on this project).

For example, I ran the following on my local machine [0]:

import sklearn.datasets
import cuml
from sklearn.svm import SVC

X, y = sklearn.datasets.make_classification(n_samples=15000, n_features=30)
%timeit _ = SVC().fit(X, y).predict(X)
%timeit _ = cuml.svm.SVC().fit(X, y).predict(X)
8.68 s ± 64.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
366 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

[0] System

  • CPU: Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz, CPU(s): 12
  • GPU: Quadro RTX 8000
  • Related