Calculate F1-score in a Named Entity Recognition model with sklearn-CodePudding

I'm trying to validate an ML model that performs Named Entity Recognition on some data. My question is related to the way F1-Score is calculated. Im using classification_report from sklearn

print(preds[:2],labels[:2])

#output: [[5, 5, 5, 5, 5, 5, 5, 5, 1],[6, 2]]  [[1, 5, 5, 5, 5, 5, 5, 5, 5, 5],[6, 2]]

Here we have the true labels Ids and the predicted Ids. As we can see, the first label differs, i.e., in the prediction it has value "1", however, in the labels it has value "5". The way I see it, it means that the token was wrongly classified. Next, in order to use the sklearning metrics, we have to convert our arrays with fit_transform from MultiLabelBinarizer, since we have more than 2 labels.

transformed_labels = MultiLabelBinarizer().fit_transform(labels)
transformed_preds = MultiLabelBinarizer().fit_transform(preds)
print(transformed_preds [:2],transformed_labels [:2])

#output: [[1 0 0 1 0],[0 1 0 0 1]]  [[1 0 0 1 0],[0 1 0 0 1]]

This is the part I don't understand. This method only saves which entities were used in each sequence, but doesn't care about any order or quantity of labels.

labels = ['Date','Place','Org','Person','Event']
    print(classification_report(transformed_labels ,transformed_preds ,target_names=labels))

              precision    recall  f1-score   support

        Date       0.69      0.92      0.82       122
       Place       0.90      0.94      0.93       195
         Org       0.76      0.85      0.78        79
      Person       0.99      0.98      0.98       434
       Event       0.81      0.69      0.73        55

In the end, the metrics are quite high, but I don't think they are accurate since they were calculated only based on if the predicted sequence had at least one occurrence of each entity present in the true sequence.

Im i reading it wrongly?

Another approach i saw:

def calc_precision(pred, true):
    precision = len([x for x in pred if x in true]) / (len(pred)   1e-20) # true positives / total pred
    return precision

Here, we are calculating the precision of the pred list against the true list.

For that, the function only checks if the predicted labels are in the true labels list. Again, the number of occurrences and order is not taken into account.

What would be the correct way to calculate the F1-score in NER?

CodePudding user response：

You can use the f1 score for validation. you can call the function directly unless your are doing multilabel classification. it would be easy to help if you explain why there are 2 lists inside preds, you can refer to this documentation https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

CodePudding user response：

I accept that you use IOB2 tagging scheme for your dataset, and continue to answer considering this scheme. Since the strict matching types and boundaries is important in NER, you may adjust your prediction outputs and true labels as a list of list and use Seqeval Library.

An example of use it with a toy example as follows:

from seqeval.metrics import f1_score
y_true = [['B-PER', 'I-PER', 'O'], ['O', 'O', 'B-LOC']]
y_pred = [['B-PER', 'O', 'O'], ['O', 'O', 'B-LOC']]
f1_score(y_true, y_pred)

Considering the toy example, f1 score should be 0.5 since PER entity is not predicted correctly with its boundary and LOC entity is predicted correctly.