Home > Mobile >  training a Word2Vec model with a lot of data
training a Word2Vec model with a lot of data

Time:10-15

I am using gensim to train a word2vec model. The problem is that my data is very large (about 10 million documents) so my session is crashing when I try to estimate the model.

Note that I am able to load all the data at once in the RAM in a Pandas dataframe df, which looks like:

text               id
long long text      1
another long one    2
...                 ...

My simple approach is to do the following:

tokens = df['text'].str.split(r'[\s] ')
model = Word2Vec(tokens, min_count = 50)

However, my session crashed when it tries to create the tokens all at once. Is there a better way to proceed in gensim? Like feeding the data line by line?

Thanks!

CodePudding user response:

Iterate over your dataframe row by row, tokenizing just one row at a time. Write each tokenized text to a file in turn, with spaces between the tokens, and a line-end at the end of each text.

You can then use the LineSentence utility class in Gensim to provide a read-from-disk iterable corpus to the Word2Vec model.

  • Related