Tensorflow metrics confusion: accuracy and loss looking good but confusion matrix locks very bad-CodePudding

after years of reading, it is finally time form my first question:

Using tensorflow and keras in a jupyter notebook, I trained a VGG16 Model on 20k sound spectrograms (my own dataset) and a bit of data augmentation using a data generator to do a 4-class multiclass classification. Below, my code:

import tensorflow as tf
from tensorflow.keras.applications.vgg16 import VGG16

model = VGG16(include_top=True,
              weights=None,
              input_tensor=None,
              pooling=None,
              classes=len(labels),
              classifier_activation="softmax")


from keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras import optimizers

# Rescale by 1/255, add data augmentation:
train_datagen = ImageDataGenerator(
      rescale=1./255,
      width_shift_range=0.2,
      brightness_range=[0.8,1.2],
      fill_mode='nearest')

# Note that the validation data should not be augmented!
test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
        # This is the target directory
        train_dir,
        # All images will be resized to 224x224
        target_size=(224, 224),
        batch_size=20,
        # one hot label for multiclass
        class_mode='categorical')

validation_generator = test_datagen.flow_from_directory(
        validation_dir,
        target_size=(224, 224),
        batch_size=20,
        class_mode='categorical')

model.compile(loss='categorical_crossentropy',
              optimizer=optimizers.RMSprop(learning_rate=2e-5),
              metrics=[tf.keras.metrics.CategoricalAccuracy(), 
                       tf.keras.metrics.Precision(), 
                       tf.keras.metrics.Recall()])

# Train the model:
history = model.fit(
      train_generator,
      steps_per_epoch=100,
      epochs=100,
      validation_data=validation_generator,
      validation_steps=50,
      verbose=2)

To evaluate the training process I plotted acc, loss, precision, recall and f1-score. All of them looking good and indicating the training went well.

When I use modell.evaluate on my test set, I get a acc of 91%.

test_generator = test_datagen.flow_from_directory(
        test_dir,
        target_size=(224, 224),
        batch_size=20,
        class_mode='categorical')

test_loss, test_acc, test_precison, test_recall = model.evaluate(test_generator, steps=50)
print('test_acc:'   str(test_acc))

Found 4724 images belonging to 4 classes. 50/50 [==============================] - 2s 49ms/step - loss: 0.2739 - categorical_accuracy: 0.9120 - precision: 0.9244 - recall: 0.9050 test_acc:0.9120000004768372

But when I try to plot a confusion matrix in the following way, it looks horrible and when I calculate the acc manualy form the data I created the confusion matrix from, I get a acc of 25%. Which with 4 classes would mean my model learn absolutely nothing…

import numpy as np
import sklearn.metrics

# Print confuision matrix for test set:

test_pred_raw = model.predict(test_generator)
print('raw preditcitons:')
print(test_pred_raw)

test_pred = np.argmax(test_pred_raw, axis=1)
print('prediction:')
print(test_pred)

test_labels = test_generator.classes
print('labels')
print(test_labels)

# Calculate accuracy manualy:
my_test_acc = sum(test_pred == test_labels) / len(test_labels)
print('My_acc:')
print(my_test_acc)

# Calculate the confusion matrix using sklearn.metrics
cm = sklearn.metrics.confusion_matrix(test_labels, test_pred)    
figure = plot_confusion_matrix(cm, class_names=labels)

raw preditcitons:
[[2.9204198e-12 2.8631955e-09 1.0000000e 00 7.3386294e-16]
[1.1940503e-11 8.0026985e-11 1.0000000e 00 7.3565399e-16]
[0.0000000e 00 1.0000000e 00 0.0000000e 00 0.0000000e 00]
...
[2.2919695e-03 3.8061540e-07 9.9770677e-01 8.1024604e-07]
[5.7501338e-35 1.0000000e 00 0.0000000e 00 0.0000000e 00]
[0.0000000e 00 1.0000000e 00 4.0776377e-37 2.6318860e-38]]
prediction:
[2 2 1 ... 2 1 1]
labels
[0 0 0 ... 3 3 3]
My_acc:
0.2491532599491956\

My question now is, which of the metrics can I trust and what is wrong with the other one?

CodePudding user response：

The problem might be:

my_test_acc = sum(test_pred == test_labels) / len(test_labels)

Maybe you should add a rounding up step to be sure the predicted values are really 1.0 and not 0.99.

CodePudding user response：

Okay. I think I got it!

Setting shuffle = False in test_datagen.flow_from_directory() seems to solve he problem. Now the confusion matrix looks way better and my_acc = 89% looks fine to.

It seems that the data generator yields different batches when called twice. First, by model.predict(test_generator) and then again by test_generator.classes, basicly making the label and predictions not match because they are for different batches.

Can someone confirm I got this right?