How to make spacy train faster on NER for persian language-CodePudding

I have the blank model of spacy, in the config file I user the widget Training Pipelines & Models with this config

Language = Arabic
Components = ner
Hardware = CPU
Optimize for = accuracy

then in config-file I changed the:

[nlp]
lang = "ar"

[nlp]
lang = "fa"

because spacy does not support persian-language, so I can't use the GPU (transformer)

and as you know the accuracy type is very slow and I have 400,000 record's

this is my config-file

[paths]
train = null
dev = null
vectors = null
[system]
gpu_allocator = null

[nlp]
lang = "fa"
pipeline = ["tok2vec","ner"]
batch_size = 1000

[components]

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH", "SHAPE"]
rows = [5000, 2500]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3

[components.ner]
factory = "ner"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}

[corpora]

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"

[training.optimizer]
@optimizers = "Adam.v1"

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001

[initialize]
vectors = ${paths.vectors}

how can I make the training process faster?

CodePudding user response：

You might just be using one core of your CPU, as that is kind of the Python default iirc. I would look into parallelizing the job with joblib and increasing your chunk size.

See: https://prrao87.github.io/blog/spacy/nlp/performance/2020/05/02/spacy-multiprocess.html#Option-3:-Parallelize-the-work-using-joblib

CodePudding user response：

To speed up training you have a few options.

Change the evaluation frequency. It's not in the config the widget generates, but there's an eval_frequency option - it should be filled in if you use fill-config as recommended. The default value is relatively low, and evaluation is slow. You should increase this value a lot if you have a large amount of training data.

Use the efficiency presets instead of accuracy. If speed is an issue then you should try this. For your pipeline, the relevant options are whether to include static vectors or not, and the width or depth of your tok2vec. Note this alone won't affect speed that much, but because it definitely reduces memory usage it can be usefully combined with the next option.

Increase batch size. In training the time to process a batch is relatively constant, so larger batches means fewer batches for the same data, which means faster training. How large a batch you can handle depends on the size of your documents and your hardware.

Use less training data. This is very rarely something that I'd recommend, but if you have 400,000 records you shouldn't need that many to get a good NER model. (How many classes do you have?) Try 10,000 to start with and see how your model performs, and scale up until you get the accuracy/speed tradeoff you want. This will also help you figure out if there is some kind of issue with your data more quickly.

For tips on faster inference (not training), see the spaCy speed FAQ.