Home > database >  Different features give the same Accuracy in SVM
Different features give the same Accuracy in SVM

Time:10-26

For a classification task I try to compare two models (with SVM) where the difference between the models is just one feature (namely, cult_distance & inst_dist). Even after HP-tuning (using GridSearchCV) both models show the exact same accuracy. When I try a third model with both cult_distance & inst_dist in the model, still the exact same accuracy. My code is below and in the screenshot you see the classification report. As you can see in the report, the minority class is not predicted. I think my model is overfitting but I don't know how to solve this. Note: also when I run the model without any other features than cult_distance & inst_dist, the accuracy is still the same.

features1 = ['inst_dist', 'cult_distance', 'deal_numb', 'acq_sic', 'acq_stat', 'acq_form',
       'tar_sic', 'tar_stat', 'tar_form', 'vendor_name', 'pay_method',
       'advisor', 'stake', 'deal_value', 'days', 'sic_sim']
X = data.loc[:, features1]
y = data.loc[:, ['deal_status']]

### training the model

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, train_size = .75)

# defining parameter range 

rfc = svm.SVC(random_state = 0)

param_grid = {'C': [0.001], 
              'gamma': ['auto'],
              'kernel': ['rbf'],
             'degree':[1]} 

grid = GridSearchCV(rfc, param_grid = param_grid, refit = True, verbose = 2,n_jobs=-1) 
   
# fitting the model for grid search 
grid.fit(X_train, y_train) 
 
# print best parameter after tuning  
grid_predictions = grid.predict(X_test) 
   
# print classification report 
print(classification_report(y_test, grid_predictions)) 

The output of the classification report

EDIT: After a useful comment, I used class_weight to insert in the dataset. However the accuracy of the data is much lower than the baseline of 0.76?

enter image description here

CodePudding user response:

The problem seems to be connected with unbalanced dataset. One class present much more in your dataset, so model learns to predict only that class. There are numerous ways to combat that, the simplest one is to adjust class_weight parameter in your model. In your situation it should be around class_weight = {0 : 3.2, 1 : 1} as second class is 3 times more popular. You can calculate that using compute_class_weight from sklearn.

Hope that was helpful!

  • Related