Home > Software design >  Multiple problems with Logistic Regression (1. all CV values have the same score, 2. classification
Multiple problems with Logistic Regression (1. all CV values have the same score, 2. classification

Time:10-02

I have implemented logistic regression on bank loan data. I have used gridsearchCV for hyperparameter tuning and implemented Logistic regression with multiple kfolds = [3,5,6] this is my code

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#from google.colab import files
import io

import warnings
warnings.filterwarnings('ignore')
#uploaded = files.upload()

df = pd.read_csv('CleanedLoanData13Cols.csv')

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

X = df.drop('loan_status', axis=1, inplace=False)
y = df['loan_status']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 4)
parameters = {'penalty': ['l1', 'l2','elasticnet'],
                  'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
                  'solver' : ['liblinear', 'newton-cg', 'lbfgs', 'saga', 'sag'],
                  'multi_class' : ['auto'],
                  'max_iter'    : [5,15,25]
                 }

import warnings
warnings.filterwarnings("ignore")

cv_folds = [3, 5, 6]
s_scaler = StandardScaler()
#m_scaler = MinMaxScaler()
#r_scaler = RobustScaler()
s_scaled_X_train = s_scaler.fit_transform(X_train)
s_scaled_X_test = s_scaler.transform(X_test)

for x in cv_folds:
    logmodel = GridSearchCV(LogisticRegression(random_state = 42), parameters, cv = x, scoring = 'accuracy', refit = True)
    logmodel.fit(X_train, y_train)
    
    print('The best score with CV =', x, 'is', logmodel.score(X_test, y_test), 'with parameters =\n\n', logmodel.best_params_, '\n\n')

the output: (first issue: this didn't seem right to me! correct me if I'm wrong?)

The best score with CV = 3 is 0.929636746271388 with parameters =

 {'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'} 

The best score with CV = 5 is 0.929636746271388 with parameters =

 {'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'} 


The best score with CV = 6 is 0.929636746271388 with parameters =

 {'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'} 

continuation

results = logmodel.cv_results_

print(results.get('params'))

print(results.get('mean_test_score'))

output:

[0.9084348         nan        nan 0.8323203         nan 0.83239873
 0.83671225 0.8323203  0.8323203  0.8323203         nan        nan
        nan        nan        nan 0.91647373        nan        nan
 0.8323203         nan 0.902435   0.89474906 0.8520445  0.8323203 and so on

continuation:

print(results.get('mean_train_score'))

output: None

print(logmodel.best_params_)

{'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'}

print(logmodel.best_score_)

output: 0.9226303384209481 (i think there is something wrong here too because this and accuracy in the classification report don't match)

final_model = logmodel.best_estimator_

s_predictions = final_model.predict(s_scaled_X_test)

from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix

print(classification_report(y_test, s_predictions))
print(confusion_matrix(y_test, s_predictions))

output: accuracy here is 0.62 whereas on the top is 92

precision    recall  f1-score   support

           0       0.88      0.64      0.74      9197
           1       0.22      0.53      0.31      1732

    accuracy                           0.62     10929
   macro avg       0.55      0.59      0.53     10929
weighted avg       0.77      0.62      0.67     10929

[[5902 3295]
 [ 812  920]]

I don't know where I went wrong? I have been banging my head on this for the last few hours and I am not able to understand where did I go wrong? Would really be thankful if anyone gave their input on this?

CodePudding user response:

The problem here is that your are fitting your model on unscaled data X_train, y_train.

logmodel.fit(X_train, y_train)

Then you trying to predicting on scaled data s_scaled_X_test which explain this drop in performance.

s_predictions = final_model.predict(s_scaled_X_test)

To fix that you should train your model using scaled data as follows:

logmodel.fit(s_scaled_X_train, y_train)
  • Related