Doubts in accuracy assessment of binary classification-CodePudding

I am doing a binary classification using the sequential() model of Keras. I have some doubts about its accuracy assessment.

I am calculating AUC-ROC for it. For this, should I use the prediction probability or the prediction class?

Explanation:

After training the model, I'm doing model.predict() to find prediction values for training and validation data (code below).

y_pred_train = model.predict(x_train_df).ravel()
y_pred_val = model.predict(x_val_df).ravel()

fpr_train, tpr_train, thresholds_roc_train = roc_curve(y_train_df, y_pred_train, pos_label=None)
fpr_val, tpr_val, thresholds_roc_val = roc_curve(y_val_df, y_pred_val, pos_label=None)

roc_auc_train = auc(fpr_train, tpr_train)
roc_auc_val = auc(fpr_val, tpr_val)

plt.figure()
lw = 2
plt.plot(fpr_train, tpr_train, color='darkgreen',lw=lw, label='ROC curve Training (area = %0.2f)' % roc_auc_train)
plt.plot(fpr_val, tpr_val, color='darkorange',lw=lw, label='ROC curve Validation (area = %0.2f)' % roc_auc_val)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--',label='Base line')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

This shows the plot as this. training and validation accuracies are 0.76 and 0.76.

model.predict() gives the probabilities not the actual predicted class, so I chnaged the first two line of the above code sample to give the class as;

y_pred_train = (model.predict(x_train_df).ravel() > 0.5).astype("int32")
y_pred_val = (model.predict(x_test_df).ravel() > 0.5).astype("int32")

So this now calculates the AUC-ROC from the class values (I guess). But the accuracy I am getting with this is very different and low. Training and validation accuracies are 0.66 and 0.46. (plot).

What is the correct way between these two and why the difference in accuracies?

CodePudding user response：

The ROC is normally created by plotting sensitivity (TPR) vs specificity (FPR) when varying the class thresholding value from 0. to 1.0 eg: See for instance here: https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc Some pseudo code to get you started:

pred_proba = model.predict(x_train_df).ravel()

for thresh in np.arange (0, 1, 0.1):
    pred = np.where(pred_proba >thresh ,1,0)

    # assuming you have a truth array of 0,1 classifications
    #now you can assess sensitivy by calculating true positive, false positive,...
    tp= np.count_nonzero(truth & pred)
    # same for false positive, false negative,...
    # they you can evaluate your sensitivity (TPR) and specificity(FPR) for the current threshold
    tpr = (tp / (tp   fn)
    # same for fpr
    # now you can plot the tpr, fpr point for the current treshold value