Currently using gensim 4.0 library to write the code. However, I don't know why it keeps failing in finding a similar word. At first, when I set up min_count = 5, the error is, that it wants me to build a vocab first, but after I reduce it to min_count = 1, it says, key error not present...Full code with datasets over here: https://github.com/JYjunyang/FYPDEMO Am I writing something wrong or missing some important steps? Everything works fine but just this word2vec implementation...Will appreciate for every guidance provided... Take note: LemmaColumn is a dataframe after lemmatization
def FeaturesExtraction():
word2vec =
Word2Vec(sentences=LemmaColumn,vector_size=100,window=5,min_count=1,workers=8,sg=1)
b1 = time.time()
train_time = time.time() - b1
print(word2vec.wv.most_similar('virus', topn=10))
And I not sure why, after training with 10k data, unique words in vocabulary only have 7:
word #0/7 is t
word #1/7 is l
word #2/7 is x
word #3/7 is e
word #4/7 is _
word #5/7 is u
word #6/7 is f
CodePudding user response:
Your LemmaColumn
variable probably isn't in the format Word2Vec
needs for the sentences
argument. It needs a Python sequence: something than can be iterated over multiple times, like a list, or another re-iterable object. And in that sequence, every individual item must itself be a list-of-string-tokens (words).
Your tiny vocabulary is instead what I'd expect to see if instead:
LemmaColumn = [
['f', 'u', 'l', 'l', '-', 't', 'e', 'x', 't'],
]
…or even…
LemmaColumn = [
['full-text'],
]
…because Python will happily treat a plain string (like 'full-text'
) as if it were a list filled with 1-character strings. Thus your entire training vocabular is only the characters of that single string – likely a column-name, rather than the column-data you want to be using.
Double-check what's in LemmaColumn
. Perform the necessary transformations on the column's data to make it the kind of sequence Word2Vec
expects, & confirm it looks sensible before trying Word2Vec
.
Also: running with logging on to at least the INFO
level will show a lot more of the model's progress, and as you learn to understand the reported steps/progress, things like weirdly-low counts of texts/words, or steps that'd take time if they were working on the right amount (lots) of data completing instantly, will be evident sooner.
Finally, note that min_count=1
is essentially always a bad idea with an algorithm like word2vec. Good vectors only come from multiple varied examples of the same word's usage – hence the default min_count=5
. Keeping rare words not only tends to get poor vectors for those rare words, but the fact that natural-language text tends to have lots of such rare words means so much of the model's time & space is devoted to the (nearly hopeless) task of improving those junk words' vectors that other nearby words' vectors suffer as well.