Word2vec raise KeyError(f"Key '{key}' not present")-CodePudding

Currently using gensim 4.0 library to write the code. However, I don't know why it keeps failing in finding a similar word. At first, when I set up min_count = 5, the error is, that it wants me to build a vocab first, but after I reduce it to min_count = 1, it says, key error not present...Full code with datasets over here: https://github.com/JYjunyang/FYPDEMO Am I writing something wrong or missing some important steps? Everything works fine but just this word2vec implementation...Will appreciate for every guidance provided... Take note: LemmaColumn is a dataframe after lemmatization

def FeaturesExtraction():
    word2vec = 
Word2Vec(sentences=LemmaColumn,vector_size=100,window=5,min_count=1,workers=8,sg=1)
    b1 = time.time()
    train_time = time.time() - b1
    print(word2vec.wv.most_similar('virus', topn=10))

And I not sure why, after training with 10k data, unique words in vocabulary only have 7:
word #0/7 is t
word #1/7 is l
word #2/7 is x
word #3/7 is e
word #4/7 is _
word #5/7 is u
word #6/7 is f

CodePudding user response：

Your LemmaColumn variable probably isn't in the format Word2Vec needs for the sentences argument. It needs a Python sequence: something than can be iterated over multiple times, like a list, or another re-iterable object. And in that sequence, every individual item must itself be a list-of-string-tokens (words).

Your tiny vocabulary is instead what I'd expect to see if instead:

LemmaColumn = [ 
    ['f', 'u', 'l', 'l', '-', 't', 'e', 'x', 't'],
]

…or even…

LemmaColumn = [ 
    ['full-text'],
]

…because Python will happily treat a plain string (like 'full-text') as if it were a list filled with 1-character strings. Thus your entire training vocabular is only the characters of that single string – likely a column-name, rather than the column-data you want to be using.

Double-check what's in LemmaColumn. Perform the necessary transformations on the column's data to make it the kind of sequence Word2Vec expects, & confirm it looks sensible before trying Word2Vec.

Also: running with logging on to at least the INFO level will show a lot more of the model's progress, and as you learn to understand the reported steps/progress, things like weirdly-low counts of texts/words, or steps that'd take time if they were working on the right amount (lots) of data completing instantly, will be evident sooner.

Finally, note that min_count=1 is essentially always a bad idea with an algorithm like word2vec. Good vectors only come from multiple varied examples of the same word's usage – hence the default min_count=5. Keeping rare words not only tends to get poor vectors for those rare words, but the fact that natural-language text tends to have lots of such rare words means so much of the model's time & space is devoted to the (nearly hopeless) task of improving those junk words' vectors that other nearby words' vectors suffer as well.