I'm attempting to do a grid search to optimize my model but it's taking far too long to execute. My total dataset is only about 15,000 observations with about 30-40 variables. I was successfully able to run a random forest through the gridsearch which took about an hour and a half but now that I've switched to SVC it's already ran for over 9 hours and it's still not complete. Below is a sample of my code for the cross validation:
from sklearn.model_selection import GridSearchCV
from sklearn import svm
from sklearn.svm import SVC
SVM_Classifier= SVC(random_state=7)
param_grid = {'C': [0.1, 1, 10, 100],
'gamma': [1,0.1,0.01,0.001],
'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
'degree' : [0, 1, 2, 3, 4, 5, 6]}
grid_obj = GridSearchCV(SVM_Classifier,
return_train_score=True,
param_grid=param_grid,
scoring='roc_auc',
cv=3,
n_jobs = -1)
grid_fit = grid_obj.fit(X_train, y_train)
SVMC_opt = grid_fit.best_estimator_
print('='*20)
print("best params: " str(grid_obj.best_estimator_))
print("best params: " str(grid_obj.best_params_))
print('best score:', grid_obj.best_score_)
print('='*20)
I have already reduced the cross validation from 10 to 3, and I'm using n_jobs=-1 so I'm engaging all of my cores. Is there anything else I'm missing that I can do here to speed up the process?
CodePudding user response:
Unfortunately, SVC's fit algorithm is O(n^2) at best, so it indeed is extremely slow. Even the documentation suggests to use LinearSVC above ~10k samples and you are right in that ballpark.
Maybe try to increase the kernel cache_size
. I would suggest timing a single SVC fit with different cache sizes to see whether you can gain something.
EDIT: by the way, you are needlessly computing a lot of SVC fits with different degree
parameter values, where that will be ignored (all the kernels but poly
). I suggest splitting the runs for poly
and the other kernels, you will save a lot of time.
CodePudding user response:
While exploring LinearSVC might be a good choice (and you should clean up the parameter combinations as noted in the other answer), you could also use a GPU accelerated SVC estimator in RAPIDS cuML on a GPU-enabled cloud instance of your choice (or locally if you have an NVIDIA GPU). This estimator can be dropped directly into your GridSearchCV
function if you use the default n_jobs=1
. (Disclaimer: I work on this project).
For example, I ran the following on my local machine [0]:
import sklearn.datasets
import cuml
from sklearn.svm import SVC
X, y = sklearn.datasets.make_classification(n_samples=15000, n_features=30)
%timeit _ = SVC().fit(X, y).predict(X)
%timeit _ = cuml.svm.SVC().fit(X, y).predict(X)
8.68 s ± 64.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
366 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
[0] System
- CPU: Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz, CPU(s): 12
- GPU: Quadro RTX 8000