Using the same unseen data for validation and testing has large performance metric difference-CodePudding

Overview: During transfer model (ResNet) training, when setting the test data as the validation data during training, the VALIDATION performance metrics for the last epoch (15) are: binary accuracy of 0.85, f1 of 0.84, precision of 0.84, and recall of 0.85. However, after training and fine-tuning, the model prediction on the test set yields a poor confusion matrix:

[[223, 277]
[233, 267]]

Training Data: Perfectly balanced with 2000 positive and 2000 negative samples of retinal fundus images.

Validation/Testing Data: Perfectly balanced with 500 positive and 500 negative samples of retinal fundus images.

Data Splitting:

train: 2 column dataframe with 4000 retinal fundus images paths and 4000 label types
test: 2 column dataframe with 1000 retinal fundus images paths and 1000 label types

Generator Code:

from tensorflow.keras.applications import ResNet50V2
from tensorflow.keras.applications.resnet_v2 import preprocess_input
from keras.preprocessing.image import ImageDataGenerator

target = 512

trainDataGen = ImageDataGenerator(preprocessing_function=preprocess_input, rotation_range=30, horizontal_flip=True, vertical_flip=False,shear_range = 0.2,zoom_range = 0.2,brightness_range=(0.8, 1.2))
trainGen = trainDataGen.flow_from_dataframe(dataframe=train, batch_size = 16, shuffle=True, x_col="fundus", y_col="types", class_mode="binary", validate_filenames='True', target_size=(target, target), directory=None, color_mode='rgb')

testDataGen = ImageDataGenerator(preprocessing_function=preprocess_input)
testGen = testDataGen.flow_from_dataframe(dataframe=test, x_col="fundus", y_col="types", class_mode="binary", validate_filenames='True', target_size=(target, target), directory=None, color_mode='rgb')

Example Model:

from keras.layers import Dropout, BatchNormalization, GlobalAveragePooling2D, Input, Flatten, Dropout, Dense
from keras.models import Model

base_model = ResNet50V2(weights='imagenet', include_top=False, input_shape=(target,target,3))

for layer in base_model.layers[:-4]:
    layer.trainable = False
for layer in base_model.layers[-4:]:
   layer.trainable = True

flatten = Flatten() (base_model.output)
flatten = Dropout(0.75) (flatten)
flatten = Dense(512, activation='relu') (flatten)
predictions = Dense(1, activation='sigmoid') (flatten)


model = Model(inputs=base_model.input, outputs=predictions)

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['binary_accuracy',f1_m,precision_m, recall_m])
history = model.fit_generator(trainGen, class_weight = {0:1 , 1:1}, epochs=15, validation_freq=1, validation_data=testGen)

Training Log Sample (prior to fine tuning):

Epoch 1/5
250/250 [==============================] - 643s 3s/step - loss: 1.6509 - binary_accuracy: 0.7312 - f1_m: 0.7095 - precision_m: 0.7429 - recall_m: 0.7319 - val_loss: 0.3843 - val_binary_accuracy: 0.8030 - val_f1_m: 0.7933 - val_precision_m: 0.8162 - val_recall_m: 0.7771
Epoch 2/5
250/250 [==============================] - 607s 2s/step - loss: 0.4188 - binary_accuracy: 0.8090 - f1_m: 0.7974 - precision_m: 0.8190 - recall_m: 0.8027 - val_loss: 0.3680 - val_binary_accuracy: 0.8160 - val_f1_m: 0.8118 - val_precision_m: 0.7705 - val_recall_m: 0.8718
Epoch 3/5
250/250 [==============================] - 606s 2s/step - loss: 0.3929 - binary_accuracy: 0.8278 - f1_m: 0.8113 - precision_m: 0.8542 - recall_m: 0.7999 - val_loss: 0.5049 - val_binary_accuracy: 0.7720 - val_f1_m: 0.8026 - val_precision_m: 0.7053 - val_recall_m: 0.9474
Epoch 4/5
250/250 [==============================] - 602s 2s/step - loss: 0.3491 - binary_accuracy: 0.8465 - f1_m: 0.8342 - precision_m: 0.8836 - recall_m: 0.8129 - val_loss: 0.3410 - val_binary_accuracy: 0.8350 - val_f1_m: 0.8425 - val_precision_m: 0.8038 - val_recall_m: 0.8948
Epoch 5/5
250/250 [==============================] - 617s 2s/step - loss: 0.3321 - binary_accuracy: 0.8480 - f1_m: 0.8335 - precision_m: 0.8705 - recall_m: 0.8187 - val_loss: 0.3538 - val_binary_accuracy: 0.8530 - val_f1_m: 0.8440 - val_precision_m: 0.9173 - val_recall_m: 0.7881

Model Evaluation:

from sklearn.metrics import confusion_matrix
import numpy as np
y_true = np.asarray(testGen.classes)
prediction = model.predict(testGen, verbose=1)
confusion = confusion_matrix(y_true, np.rint(prediction))

Summary: Since the validation and test data were the same, I expected similar results. However, the large performance difference and poor confusion matrix are confusing :). Assuming the code is error-free, should this be expected when using the same data for validation and testing (despite both being unseen)?

CodePudding user response：

The default behavior of flow_from_dataframe is shuffle=True, which should not be used for validation or testing data generators. When specifying shuffle=False for the trainGen variable, the confusion matrix will accurately show the performance results.