How to speed up Word2Vec's initial vocabulary scan for massive data?-CodePudding

I now have a massive data corpus dataset, it has about 11 billion sentences, each sentence has about 10 words, divided into more than 12,000 files ending with .txt.gz. I wish to Skip--Gram it with Gensim's Word2Vec.I'm using Gensim's multi-file streaming method PathLineSentences to read the data.

sentences=PathLineSentences('path')
w2vModel=Word2Vec(sentences,
              vector_size=128, 
              window=5, 
              min_count=2,
              workers=24,
              epochs=200,
              sg=1,
              hs=1,
              batch_words=100000,
              compute_loss=True）

But I found a problem that the vocabulary scan phase before training is very slow (because it can only run in a single thread?), the following is a screenshot of the TOP command information： enter image description here it has been running like the above for almost 12 hours. Can the vocabulary scan at this stage be multithreaded or is there any other way to speed it up? Thank you all.

CodePudding user response：

The initial vocabulary-scan is unfortunately single-threaded: it must read all data once, tallying up all words, to determine which rare words will be ignored, and rank all other words in frequency order.

You can have more control over the process if you refrain from passing the sentences corpus to the initial constructor. Instead, leave it off then call the later steps .build_vocab() & .train() yourself. (The .build_vocab() step will be the long single-threaded step.) Then, you have the option of saving the model after the .build_vocab() has completed. (Potentially, then, you could re-load it, tinker with some settings, and run other training sessions without requiring a full repeated scan.)

Also, if you're just starting out, I'd recommend doing initial trials with a smaller dataset - perhaps just some subsampled 1/10th or 1/20th of your whole corpus – so that you can get your process working, & optimized somewhat, before attempting the full training.

Separately, regarding your implied setup:

Using a value as low as min_count=2 is usually a bad idea with Word2Vec & related algorithms. The model can only achieve useful vectors for words with a variety of multiple usages - so the class's default of min_count=5 is a good minimum value, and when using a larger corpus (like yours) it makes more sense to increase this floor than lower it. (While increasing min_count won't speed the vocabulary-survey, it will speed training & typically improves the quality of the remaining words' vectors, because without the 'noise' of rare words, other words' training goes better.)
Even if your machine has 24 CPU cores, in the traditional (corpus-iterator) mode, Word2Vec training throughput usually maxes out with somewhere in the range of 6-12 workers (largely due to Python GIL bottlenecks). Higher values slow things down. (Unfortunately, the best value can only be found via trial & error – starting training & observing the logged rate over a few minutes – and the optimal number of workers will change with other settings like window or negative.)
With such a large dataset, an epochs=200 is overkill (and likely to take a very long time). With a large dataset, you're more likely to be able to use less than the default epochs=5 than you'll need to use more.
By setting hs=1 without also setting negative=0, you've enabled hierarchical-softmax training while leaving the default negative-sampling active. That's likely to at least double your training time, & make your model much larger, for no benefit. With a large dataset, it's odd to consider hs=0 mode at all – it becomes less performant with larger models. (You should probably just avoid touching the hs value unless you're sure you need to.)
Similarly, it's unclear why you'd want to change the defaults for batch_words & compute_loss. (The loss-tallying will slow things down, but also doesn't work very well yet - so it's rare to need.) In general, your setup changes a lot of things best left untouched, unless/until you're sure you can measure the net effects of the changes.

CodePudding user response：

Maybe this is not going to be an ideal source code snippet but there is a good article discussing some of the implications of scaling of Word2Vec training with gensim.