TensorFlow multi-label accuracy metrics-CodePudding

I am currently using a TensorFlow for multi-label classification problems (9 labels in total) and this is the model compile line:

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

y_true label of my model consists of 2 of 1's and 7 of 0's (e.g., [0,1,0,0,0,1,0,0,0]).

I tried several models using TensorFlow, but no matter how complicated the model was, the accuracy has been very poor, with accuracy around 0.3.

I was wondering whether the Keras accuracy metrics works also in the case of multi-label classifications. For example, if y_pred has a probability value of [0.1, 0.9, 0.3, 0.4, 0.5, 0.4, 0.3, 0.2, 0.1], does Keras pick out the top 2 probability from y_pred, convert them into binary labels of [0, 1, 0, 0, 1, 0, 0, 0, 0] and then compare the accuracy with the ones with y_true labels?

If not, do I have to implement my own metrics function?

Thanks in advance!

CodePudding user response：

In general, accuracy shows what portion of predicted labels match original labels.

As stated in the official documentation:

This metric creates two local variables, total and count, that are used to compute the frequency with which y_pred matches y_true. This frequency is ultimately returned as binary accuracy: an idempotent operation that simply divides total by count.

This metric shows the portion of probability predictions equal to true labels.

The tf.keras.metrics.BinaryAccuracy rounds the predicted probabilities by a given threshold (0.5 by default). So, if the model output is [0.9, 0.3, 0.6], it will be rounded to [1, 0, 1] before comparing to the true labels.

However, the accuracy metric will show not only where the predicted 1's match true labels, but where 0's match 0's. In most cases its bad for multi-label problems, as there is imbalance between 1's and 0's in the data. In your case, you have 3,5 times more 0's than 1's. If the model of yours will output only zeros, it will indeed be a bad model, but as 7/9 of original labels are zeros too, it will have 7/9 or almost 78 % accuracy right off the bat.

I suggest to use other metrics for multi-label classification:

Precision, which shows what portion of predicted 1's are actually 1's
Recall, which shows what portion of actual 1's are "found" and predicted as 1's
F1 and F-beta, which sum up precision and recall

Individually, those metrics will not give you much information on the performance of the model, so it's better to use them combined. You can read more about them in sklearn documentation and in this article.

As of implementations, there are official implementations of Precision and Recall in Tensorflow, and Tensorflow Addons has F1Score and FBetaScore.