I was trying to train a model from spacy. I have strings and their token offsets saved into the JSON file.
I have read that file using utf-8
encoding and there is no special character in it. But it raises TypeError: object of type 'NoneType' has no len()
# code for reading file
with open("data/results.json", "r", encoding="utf-8") as file:
training_data = json.loads(file.read())
I have also tried changing alignment_type
from strict
to contract
& expand
. The expand
works but shows incorrect spans.
span = doc.char_span(start, end, label, alignment_mode="contract")
The code that I'm using
import spacy
from spacy.tokens import DocBin
nlp = spacy.blank("en")
db = DocBin()
training_dataset = [[
"Department of Chemistry,Central University of Las Villas,Santa Clara,Villa Clara,54830,Cuba.",
[
[
57,
68,
"city_name"
],
[
87,
91,
"country_name"
]
]
]]
for text, annotations in training_dataset:
doc = nlp(text)
ents = []
for start, end, label in annotations:
span = doc.char_span(start, end, label)
ents.append(span)
doc.ents = ents
db.add(doc)
I have pasted the JSON object that is read from the file, directly into the program for debugging purposes.
When I tried after removing the 54830,
part, the program runs successfully.
I have also referred to this Issue, but that issue has a special character. But this string doesn't have any special character.
Can anyone know why this is happening with all strings that contain a number in them?
CodePudding user response:
The error TypeError: object of type 'NoneType' has no len()
occurs in line doc.ents = ents
when one of the entries in ents
is None
.
The reason for having a None
in the list is that doc.char_span(start, end, label)
returns None
when the start
and end
provided don't align with token boundaries.
The tokenizer of the model (spacy.blank("en")
) doesn't behave as needed for this use case. It seems that it doesn't produce an end of token after a comma that follows a number without space after the comma.
Examples:
Tokenizing a number with decimals:
>>> import spacy
>>> nlp = spacy.blank("en")
>>> nlp.tokenizer.explain("5,1")
[('TOKEN', '5,1')]
One single token.
Tokenizing a number comma letter:
>>> nlp.tokenizer.explain("5,a")
[('TOKEN', '5,a')]
One single token.
Tokenizing a letter comma letter:
>>> nlp.tokenizer.explain("a,a")
[('TOKEN', 'a'), ('INFIX', ','), ('TOKEN', 'a')]
Three tokens.
Tokenizing a number comma space letter:
>>> nlp.tokenizer.explain("5, a")
[('TOKEN', '5'), ('SUFFIX', ','), ('TOKEN', 'a')]
Three tokens.
Tokenizing a number comma space number:
>>> nlp.tokenizer.explain("5, 1")
[('TOKEN', '5'), ('SUFFIX', ','), ('TOKEN', '1')]
Three tokens.
Therefore, with the default tokenizer, a space is needed after a comma following a number so the comma is used to create the token boundaries.
Workarounds:
- Preprocess your text to add a space after the commas you desire to split tokens by. This would also require to update the
start
andend
values of the annotations. - Create your custom tokenizer as described in Spacy documentation: https://spacy.io/usage/linguistic-features#native-tokenizers