I am using gensim
to train a word2vec
model. The problem is that my data is very large (about 10 million documents) so my session is crashing when I try to estimate the model.
Note that I am able to load all the data at once in the RAM in a Pandas dataframe df
, which looks like:
text id
long long text 1
another long one 2
... ...
My simple approach is to do the following:
tokens = df['text'].str.split(r'[\s] ')
model = Word2Vec(tokens, min_count = 50)
However, my session crashed when it tries to create the tokens all at once. Is there a better way to proceed in gensim
? Like feeding the data line by line?
Thanks!
CodePudding user response:
Iterate over your dataframe row by row, tokenizing just one row at a time. Write each tokenized text to a file in turn, with spaces between the tokens, and a line-end at the end of each text.
You can then use the LineSentence
utility class in Gensim to provide a read-from-disk iterable corpus to the Word2Vec
model.