I now have a massive data corpus dataset, it has about 11 billion sentences, each sentence has about 10 words, divided into more than 12,000 files ending with .txt.gz. I wish to Skip--Gram it with Gensim's Word2Vec.I'm using Gensim's multi-file streaming method PathLineSentences to read the data.
sentences=PathLineSentences('path')
w2vModel=Word2Vec(sentences,
vector_size=128,
window=5,
min_count=2,
workers=24,
epochs=200,
sg=1,
hs=1,
batch_words=100000,
compute_loss=True)
But I found a problem that the vocabulary scan phase before training is very slow (because it can only run in a single thread?), the following is a screenshot of the TOP command information: enter image description here it has been running like the above for almost 12 hours. Can the vocabulary scan at this stage be multithreaded or is there any other way to speed it up? Thank you all.
CodePudding user response:
The initial vocabulary-scan is unfortunately single-threaded: it must read all data once, tallying up all words, to determine which rare words will be ignored, and rank all other words in frequency order.
You can have more control over the process if you refrain from passing the sentences
corpus to the initial constructor. Instead, leave it off then call the later steps .build_vocab()
& .train()
yourself. (The .build_vocab()
step will be the long single-threaded step.) Then, you have the option of saving the model after the .build_vocab()
has completed. (Potentially, then, you could re-load it, tinker with some settings, and run other training sessions without requiring a full repeated scan.)
Also, if you're just starting out, I'd recommend doing initial trials with a smaller dataset - perhaps just some subsampled 1/10th or 1/20th of your whole corpus – so that you can get your process working, & optimized somewhat, before attempting the full training.
Separately, regarding your implied setup:
- Using a value as low as
min_count=2
is usually a bad idea withWord2Vec
& related algorithms. The model can only achieve useful vectors for words with a variety of multiple usages - so the class's default ofmin_count=5
is a good minimum value, and when using a larger corpus (like yours) it makes more sense to increase this floor than lower it. (While increasingmin_count
won't speed the vocabulary-survey, it will speed training & typically improves the quality of the remaining words' vectors, because without the 'noise' of rare words, other words' training goes better.) - Even if your machine has 24 CPU cores, in the traditional (corpus-iterator) mode,
Word2Vec
training throughput usually maxes out with somewhere in the range of 6-12 workers (largely due to Python GIL bottlenecks). Higher values slow things down. (Unfortunately, the best value can only be found via trial & error – starting training & observing the logged rate over a few minutes – and the optimal number of workers will change with other settings likewindow
ornegative
.) - With such a large dataset, an
epochs=200
is overkill (and likely to take a very long time). With a large dataset, you're more likely to be able to use less than the defaultepochs=5
than you'll need to use more. - By setting
hs=1
without also settingnegative=0
, you've enabled hierarchical-softmax training while leaving the default negative-sampling active. That's likely to at least double your training time, & make your model much larger, for no benefit. With a large dataset, it's odd to considerhs=0
mode at all – it becomes less performant with larger models. (You should probably just avoid touching thehs
value unless you're sure you need to.) - Similarly, it's unclear why you'd want to change the defaults for
batch_words
&compute_loss
. (The loss-tallying will slow things down, but also doesn't work very well yet - so it's rare to need.) In general, your setup changes a lot of things best left untouched, unless/until you're sure you can measure the net effects of the changes.
CodePudding user response:
Maybe this is not going to be an ideal source code snippet but there is a good article discussing some of the implications of scaling of Word2Vec training with gensim.