Word2Vec for network embedding ignores words (nodes) in corpus (walks)-CodePudding

I am trying to run word2vec (Skipgram) to a set of walks for training a network embedding model, in my graph I have 169343 nodes, i.e; word in the context of Word2vec, and for each node I run a random walk with length 80. Therefore, I have (169343,80) walks, i.e; sentences in Word2vec. after running SkipGram for 3 epochs I only get 28015 vectors instead of 169343. and here is the code for my Network Embedding.

def run_skipgram(walk_path):

    walks = np.load(walk_path).tolist()

    skipgram = Word2Vec(sentences=walks, vector_size=128, negative=5, window=8, sg=1, workers=6, epochs=3)
    
    keys = list(map(int, skipgram.wv.index_to_key))
    keys.sort()

    vectors = [skipgram.wv[key] for key in keys]

    return np.array(vectors)

CodePudding user response：

Are you sure your walks corpus is what you expect, and what Gensim Word2Vec expects?

For example, is len(walks) equal to 169343? Is len(walks[0]) equal to 80? Is walks[0] a list of 80 string-tokens?

Note also: by default Word2Vec uses a min_count=5 - so any token that appears fewer than 5 times is ignored during training. In most cases, this minimum – or an even higher one! – makes sense, because tokens with only 1, or a few, usage examples in usual natural-language training data can't get good word-vectors (but can, in aggregate, function as dilutive noise that worsens other vectors).

Depending on your graph, one walk from each node might not ensure that node appears at least 5 times in all the walks. So you could try min_count=1.

But it'd probably be better to do 5 walks from every starting point, or enough walks to ensure all nodes appear at least 5 times. 169,343 * 80 * 5 is still only 67,737,200 training words, with a manageable 169,343 count vocabulary. (If there's an issue expanding the whole training set as one list, you could make an iterable that generates the walks as needed, one by one, rather than all up-front.)

Alternatively, something like 5 walks per starting-node, but of only 20 steps each, would keep the corpus about the same size bu guarantee each node appears at least 5 times.

Or even: adaptively keep adding walks until you're sure every node is represented enough times. For example, pick a random node, do a walk, keep a running tally of each node's appearances so far – & keep growing the net total of every node's You could also try an adaptive corpus that keeps adding walks until every node is represented a minimum number of times.

Conceivably, for some remote nodes, that might take quite long to happen upon them, so another refinement might be: do some initial walk or walks, then tally how many visits each node got, & while the least-frequent node is below the target min_count, start another walk from it – guaranteeing it at least one more visit.

This could help oversample less-connected regions, which might be good or bad. Notably, with natural language text, the Word2Vec sample parameter is quite helpful to discard certain overrepresented words, preventing them from monopolizing training time redundantly, ensuring less-frequent words also get good representations. (It's a parameter which can sometimes provide the double-whammy of less training time and better results!) Ensuring your walks spend more time in less-connected areas might provide a similar advantage, especially if your downstream use for the vectors is just as interested in the vectors for the less-visited regions.