after years of reading, it is finally time form my first question:
Using tensorflow and keras in a jupyter notebook, I trained a VGG16 Model on 20k sound spectrograms (my own dataset) and a bit of data augmentation using a data generator to do a 4-class multiclass classification. Below, my code:
import tensorflow as tf
from tensorflow.keras.applications.vgg16 import VGG16
model = VGG16(include_top=True,
weights=None,
input_tensor=None,
pooling=None,
classes=len(labels),
classifier_activation="softmax")
from keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras import optimizers
# Rescale by 1/255, add data augmentation:
train_datagen = ImageDataGenerator(
rescale=1./255,
width_shift_range=0.2,
brightness_range=[0.8,1.2],
fill_mode='nearest')
# Note that the validation data should not be augmented!
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
# This is the target directory
train_dir,
# All images will be resized to 224x224
target_size=(224, 224),
batch_size=20,
# one hot label for multiclass
class_mode='categorical')
validation_generator = test_datagen.flow_from_directory(
validation_dir,
target_size=(224, 224),
batch_size=20,
class_mode='categorical')
model.compile(loss='categorical_crossentropy',
optimizer=optimizers.RMSprop(learning_rate=2e-5),
metrics=[tf.keras.metrics.CategoricalAccuracy(),
tf.keras.metrics.Precision(),
tf.keras.metrics.Recall()])
# Train the model:
history = model.fit(
train_generator,
steps_per_epoch=100,
epochs=100,
validation_data=validation_generator,
validation_steps=50,
verbose=2)
To evaluate the training process I plotted acc, loss, precision, recall and f1-score. All of them looking good and indicating the training went well.
When I use modell.evaluate
on my test set, I get a acc of 91%.
test_generator = test_datagen.flow_from_directory(
test_dir,
target_size=(224, 224),
batch_size=20,
class_mode='categorical')
test_loss, test_acc, test_precison, test_recall = model.evaluate(test_generator, steps=50)
print('test_acc:' str(test_acc))
Found 4724 images belonging to 4 classes. 50/50 [==============================] - 2s 49ms/step - loss: 0.2739 - categorical_accuracy: 0.9120 - precision: 0.9244 - recall: 0.9050 test_acc:0.9120000004768372
But when I try to plot a confusion matrix in the following way, it looks horrible and when I calculate the acc manualy form the data I created the confusion matrix from, I get a acc of 25%. Which with 4 classes would mean my model learn absolutely nothing…
import numpy as np
import sklearn.metrics
# Print confuision matrix for test set:
test_pred_raw = model.predict(test_generator)
print('raw preditcitons:')
print(test_pred_raw)
test_pred = np.argmax(test_pred_raw, axis=1)
print('prediction:')
print(test_pred)
test_labels = test_generator.classes
print('labels')
print(test_labels)
# Calculate accuracy manualy:
my_test_acc = sum(test_pred == test_labels) / len(test_labels)
print('My_acc:')
print(my_test_acc)
# Calculate the confusion matrix using sklearn.metrics
cm = sklearn.metrics.confusion_matrix(test_labels, test_pred)
figure = plot_confusion_matrix(cm, class_names=labels)
raw preditcitons:
[[2.9204198e-12 2.8631955e-09 1.0000000e 00 7.3386294e-16]
[1.1940503e-11 8.0026985e-11 1.0000000e 00 7.3565399e-16]
[0.0000000e 00 1.0000000e 00 0.0000000e 00 0.0000000e 00]
...
[2.2919695e-03 3.8061540e-07 9.9770677e-01 8.1024604e-07]
[5.7501338e-35 1.0000000e 00 0.0000000e 00 0.0000000e 00]
[0.0000000e 00 1.0000000e 00 4.0776377e-37 2.6318860e-38]]
prediction:
[2 2 1 ... 2 1 1]
labels
[0 0 0 ... 3 3 3]
My_acc:
0.2491532599491956\
My question now is, which of the metrics can I trust and what is wrong with the other one?
CodePudding user response:
The problem might be:
my_test_acc = sum(test_pred == test_labels) / len(test_labels)
Maybe you should add a rounding up step to be sure the predicted values are really 1.0 and not 0.99.
CodePudding user response:
Okay. I think I got it!
Setting shuffle = False
in test_datagen.flow_from_directory()
seems to solve he problem. Now the confusion matrix looks way better and my_acc = 89% looks fine to.
It seems that the data generator yields different batches when called twice. First, by model.predict(test_generator)
and then again by test_generator.classes
, basicly making the label and predictions not match because they are for different batches.
Can someone confirm I got this right?