I am dealing with fairly big corpuses and my DocBin object gets killed when I try to save it. Both to_disk and to_bytes are printing "Killed".
I am with limited python knowledge, so it isn't obvious to me right away how I can work around the issue. Can you help?
Here is my code(very straight forward and basic):
nlp = spacy.blank("en")
for text, annotations in train_data:
doc = nlp(text)
ents = []
for start, end, label in eval(annotations)['entities']:
span = doc.char_span(start, end, label=label)
if (span is None):
continue
ents.append(span)
doc.ents = ents
db.add(doc)
db.to_disk("../Spacy/train.spacy")```
CodePudding user response:
You are probably running out of RAM. Instead, save your annotation in multiple DocBin
files. You can provide a directory to --paths.train
with spacy train
instead of a single .spacy
file if you have multiple .spacy
files.