spacy doc.char_span raises error whenever there is any number in string-CodePudding

I was trying to train a model from spacy. I have strings and their token offsets saved into the JSON file.

I have read that file using utf-8 encoding and there is no special character in it. But it raises TypeError: object of type 'NoneType' has no len()

# code for reading file
with open("data/results.json", "r", encoding="utf-8") as file:
    training_data = json.loads(file.read())

I have also tried changing alignment_type from strict to contract & expand. The expand works but shows incorrect spans.

span = doc.char_span(start, end, label, alignment_mode="contract")

The code that I'm using

import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en")
db = DocBin()
training_dataset = [[
        "Department of Chemistry,Central University of Las Villas,Santa Clara,Villa Clara,54830,Cuba.",
        [
            [
                57,
                68,
                "city_name"
            ],
            [
                87,
                91,
                "country_name"
            ]
        ]
    ]]
for text, annotations in training_dataset:
    doc = nlp(text)
    ents = []
    for start, end, label in annotations:
        span = doc.char_span(start, end, label)
        ents.append(span)
    doc.ents = ents
    db.add(doc)

I have pasted the JSON object that is read from the file, directly into the program for debugging purposes.

When I tried after removing the 54830, part, the program runs successfully.

I have also referred to this Issue, but that issue has a special character. But this string doesn't have any special character.

Can anyone know why this is happening with all strings that contain a number in them?

CodePudding user response：

The error TypeError: object of type 'NoneType' has no len() occurs in line doc.ents = ents when one of the entries in ents is None.

The reason for having a None in the list is that doc.char_span(start, end, label) returns None when the start and end provided don't align with token boundaries.

The tokenizer of the model (spacy.blank("en")) doesn't behave as needed for this use case. It seems that it doesn't produce an end of token after a comma that follows a number without space after the comma.

Examples:

Tokenizing a number with decimals:

>>> import spacy
>>> nlp = spacy.blank("en")
>>> nlp.tokenizer.explain("5,1")
[('TOKEN', '5,1')]

One single token.

Tokenizing a number comma letter:

>>> nlp.tokenizer.explain("5,a")
[('TOKEN', '5,a')]

One single token.

Tokenizing a letter comma letter:

>>> nlp.tokenizer.explain("a,a")
[('TOKEN', 'a'), ('INFIX', ','), ('TOKEN', 'a')]

Three tokens.

Tokenizing a number comma space letter:

>>> nlp.tokenizer.explain("5, a")
[('TOKEN', '5'), ('SUFFIX', ','), ('TOKEN', 'a')]

Three tokens.

Tokenizing a number comma space number:

>>> nlp.tokenizer.explain("5, 1")
[('TOKEN', '5'), ('SUFFIX', ','), ('TOKEN', '1')]

Three tokens.

Therefore, with the default tokenizer, a space is needed after a comma following a number so the comma is used to create the token boundaries.

Workarounds:

Preprocess your text to add a space after the commas you desire to split tokens by. This would also require to update the start and end values of the annotations.
Create your custom tokenizer as described in Spacy documentation: https://spacy.io/usage/linguistic-features#native-tokenizers