I have implemented logistic regression on bank loan data. I have used gridsearchCV for hyperparameter tuning and implemented Logistic regression with multiple kfolds = [3,5,6] this is my code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#from google.colab import files
import io
import warnings
warnings.filterwarnings('ignore')
#uploaded = files.upload()
df = pd.read_csv('CleanedLoanData13Cols.csv')
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
X = df.drop('loan_status', axis=1, inplace=False)
y = df['loan_status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 4)
parameters = {'penalty': ['l1', 'l2','elasticnet'],
'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
'solver' : ['liblinear', 'newton-cg', 'lbfgs', 'saga', 'sag'],
'multi_class' : ['auto'],
'max_iter' : [5,15,25]
}
import warnings
warnings.filterwarnings("ignore")
cv_folds = [3, 5, 6]
s_scaler = StandardScaler()
#m_scaler = MinMaxScaler()
#r_scaler = RobustScaler()
s_scaled_X_train = s_scaler.fit_transform(X_train)
s_scaled_X_test = s_scaler.transform(X_test)
for x in cv_folds:
logmodel = GridSearchCV(LogisticRegression(random_state = 42), parameters, cv = x, scoring = 'accuracy', refit = True)
logmodel.fit(X_train, y_train)
print('The best score with CV =', x, 'is', logmodel.score(X_test, y_test), 'with parameters =\n\n', logmodel.best_params_, '\n\n')
the output: (first issue: this didn't seem right to me! correct me if I'm wrong?)
The best score with CV = 3 is 0.929636746271388 with parameters =
{'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'}
The best score with CV = 5 is 0.929636746271388 with parameters =
{'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'}
The best score with CV = 6 is 0.929636746271388 with parameters =
{'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'}
continuation
results = logmodel.cv_results_
print(results.get('params'))
print(results.get('mean_test_score'))
output:
[0.9084348 nan nan 0.8323203 nan 0.83239873
0.83671225 0.8323203 0.8323203 0.8323203 nan nan
nan nan nan 0.91647373 nan nan
0.8323203 nan 0.902435 0.89474906 0.8520445 0.8323203 and so on
continuation:
print(results.get('mean_train_score'))
output: None
print(logmodel.best_params_)
{'C': 0.001, 'max_iter': 25, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'liblinear'}
print(logmodel.best_score_)
output: 0.9226303384209481 (i think there is something wrong here too because this and accuracy in the classification report don't match)
final_model = logmodel.best_estimator_
s_predictions = final_model.predict(s_scaled_X_test)
from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix
print(classification_report(y_test, s_predictions))
print(confusion_matrix(y_test, s_predictions))
output: accuracy here is 0.62 whereas on the top is 92
precision recall f1-score support
0 0.88 0.64 0.74 9197
1 0.22 0.53 0.31 1732
accuracy 0.62 10929
macro avg 0.55 0.59 0.53 10929
weighted avg 0.77 0.62 0.67 10929
[[5902 3295]
[ 812 920]]
I don't know where I went wrong? I have been banging my head on this for the last few hours and I am not able to understand where did I go wrong? Would really be thankful if anyone gave their input on this?
CodePudding user response:
The problem here is that your are fitting your model on unscaled data X_train, y_train
.
logmodel.fit(X_train, y_train)
Then you trying to predicting on scaled data s_scaled_X_test
which explain this drop in performance.
s_predictions = final_model.predict(s_scaled_X_test)
To fix that you should train your model using scaled data as follows:
logmodel.fit(s_scaled_X_train, y_train)