Home > Blockchain >  How to seek for bigram similarity in gensim word2vec model
How to seek for bigram similarity in gensim word2vec model

Time:11-11

Here I have a word2vec model, suppose I use the google-news-300 model

import gensim.downloader as api
word2vec_model300 = api.load('word2vec-google-news-300')

I want to find the similar words for "AI" or "artifical intelligence", so I want to write

word2vec_model300.most_similar("artifical intelligence")

and I got errors

KeyError: "word 'artifical intelligence' not in vocabulary"

So what is the right way to extract similar words for bigram words?

Thanks in advance!

CodePudding user response:

At one level, when a word-token isn't in a fixed set of word-vectors, the creators of that set of word-vectors chose not to train/model that word. So, anything you do will only be a crude workaround for its absence.

Note, though, that when Google prepared those vectors – based on a dataset of news articles from before 2012 – they also ran some statistical multigram-combinations on it, creating multigrams with connecting _ characters. So, first check if a vector for 'artificial_intelligence' might be present.

If it isn't, you could try other rough workarounds like averaging together the vectors for 'artificial' and 'intelligence' – though of course that won't really be what people mean by the distinct combination of those words, just meanings suggested by the independent words.

The Gensim .most_similar() method can take either a raw vectors you've created by operations such as averaging, or even a list of multiple words which it will average for you, as arguments via its explicit keyword positive parameter. For example:

word2vec_model300.most_similar(positive=[average_vector])

...or...

word2vec_model300.most_similar(positive=['artificial', 'intelligence'])

Finally, though Google's old vectors are handy, they're a bit old now, & from a particular domain (popular news articles) where senses may not match tose used in other domains (or more recently). So you may want to seek alternate vectors, or train your own if you have sufficient data from your area of interest, to have apprpriate meanings – including vectors for any particular multigrams you choose to tokenize in your data.

  • Related