Using Word2Vec for word embedding of sentences-CodePudding

I am trying to create an emotion recognition model and for that I am using Word2Vec. I have a tokenized pandas data frame x_train['Utterance'] and I have used

model = gensim.models.Word2Vec(x_train['Utterance'], min_count = 1, vector_size = 100)

to create a vocabulary. Then, I created a dictionary embeddings_index that has as key the words and as value the vector embedding. I created a new column in my data frame where every word is replaced by the respective vector.

x_train['vector'] = x_train['Utterance'].explode().map(embeddings_index).groupby(level=0).agg(list)

Finally, I used pad_sequences so that each instance of the data set is padded to the size of the instance with biggest length (because the data set initially was made of sentences of different sizes):

x_train['vector'] = tf.keras.utils.pad_sequences(x_train.vector, maxlen = 30, dtype='float64', padding='post', truncating='post', value=0).tolist()

If min_count = 1, one of the parameters of Word2Vec, everything is alright and x_train['vector'] is what I pretend, a column of the embeddings vectors of the tokenized sentences in x_train['Utterance']. However, when min_count != 1, the created vocabulary only has the words which appears more than the min_count value in x_train['Utterance']. Because of this, when creating x_train['vector'] mapping the dictionary embeddings_index, the new column will contain lists like the following [nan, [0.20900646, 0.76452744, 2.3117824], [0...., where nan corresponds to words that are not in the dictionary. Because of this nan, when using the tf.keras.utils.pad_sequences I get the following error message: ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) inhomogeneous part.

I would like to remove the nan from each list but I am not being able. Tried the fillna('') however it just removes the nan but keeps an empty index on the list. Any idea?

CodePudding user response：

It seems the problem may that x_train['Utterance'] includes a bunch of words that (after min_count trimming) aren't in the model. As a result you may be both miscalculating the true longest-text (because you're counting with unknown words), and get some nonsense values (where no word-vector was available for a low-frequency word)

The most simple fix would be to stop using the original x_train['Utterance'] as your texts for steps that will be limited to a smaller vocabulary of only those words with word-vectors. Instead, pre-filter those text to eliminate words not present in the word-vector model. For example:

cleaned_texts = [[word for word in text if word in model.wv] 
                 for text in x_train['Utterance']]

Then, only use cleaned_texts for anything driving word-vector lookups, including your calculation of the longest text.

Other notes:

you probably don't need to create your own embeddings_index dict-like object: the Word2Vec model already offers a dict-like interface, returning a word-vector per lookup key, via the instance of KeyedVectors in its .wv property.
if your other libraries or hardware considerations don't require float64 values, you might just want to stick with float32-width values – that's what the Word2Vec model will train into word-vectors, they take half as much memory, and results from these kinds of models are rarely improved, and sometimes slowed, by using higher-precisions.
you could also consider creating a FastText model instead of plain Word2Vec - such a model will always return a vector, even for unknown words, synthesized from word-fragment-vectors that it learns while training.