How to use similarity function with words outside vocabulary?-CodePudding

I trained the word2vec model with the list of all product names from a grocery store. Then, I built the vocabulary with the common phrases & words from this list:

from gensim.models.phrases import Phrases, Phraser
from gensim.models import Word2Vec

phrases = Phrases(product_list, min_count=30, progress_per=10000)
bigram = Phraser(phrases)
common_texts = bigram[product_list]
w2v_model = Word2Vec(min_count=5,
                     window=2,
                     size=300,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20,
                     workers=1,
                     sorted_vocab=1)
w2v_model.build_vocab(sentences=common_texts, progress_per=10000)
w2v_model.train(sentences=common_texts, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

My overall goal is to map products to categories. I created a function which takes products of type Product and categories of type Category as parameters. Then it finds the similarity between each pair of product name and category name and assigns the category- which has the highest similarity- to the product.

def assign_category(products, categories):
  max_similarity = 0
  for product in products:
    for category in categories:
      current_similarity = w2v_model.wv.similarity(Product.get_name(product), Category.get_name(category))
      if current_similarity > max_similarity:
        max_similarity = current_similarity
        product.set_category(category)
        product.set_match_rate(max_similarity)

My problem is that the function w2v_model.wv.similarity does not accept the words or phrases which are not in the vocabulary of the model. So, when I test the function with the following product name and category

product = Product("salty potato chips with cheese and oninon")
category = Category("potato chips")

I get the error

word 'salty potato chips with cheese and oninon' not in vocabulary
Examples of words in vocabulary: *chicken, milk, ice cream, potato chips, etc.

Note: When I build the vocabulary with the whole list of product names instead of common phrases & words, of course, the vocabulary doesn't extend enough to include the product names. I understand that there is no way of vocabulary to include every single product name. But since my model is good at finding the similar products/ categories when they are in the vocabulary (e.g. "bread" and "baguette"), I would like to find a way to use this model to reach my goal.

How can I use the Word2Vec model and check the similarity of any two strings, without the limitations of the vocabulary? If this is not possible, what would you recommend for me to reach my goal?

CodePudding user response：

I'm not sure if Word2Vec is the best option in this case but I could be wrong. I think going for a text classification kind of approach is more suitable given that you have a list of products and the respective categories they fall under.

However, to answer you in this case, you could simply filter out OOV tokens that aren't included in the model's vocabulary before getting the similarity.

# pip install gensim==4.2.0

from gensim.models import Word2Vec
from typing import Text, List, NoReturn

def preprocess_sentence(tokens: Text) -> Text:
  """
  preprocesses a given sentence
  """
  token_list = [token.lower() for token in tokens.split()]
  # apply other preprocessing steps here
  return token_list

def filter_tokens(tokens) -> List[Text]:
  """
  generates a list of tokens and removes any token that
  is not present in the w2v_model's vocabulary
  """
  return [token for token in preprocess_sentence(tokens) if token in vocabulary]

def assign_categories(product_list: List, category_list: List) -> NoReturn:
  """
  loops through the list of products and categories and
  dumps the assigned category details into a dict
  """
  product_categories = dict()
  for product in product_list:
    filtered_product_tokens = filter_tokens(product)
    product_categories.update({product: {'category': None, 'similarity': 0}})

    if not filtered_product_tokens:
      continue
    
    for category in categories:
      filtered_category_tokens = filter_tokens(category)

      if not filtered_category_tokens:
        continue
      
      similarity = w2v_model.wv.n_similarity(filtered_product_tokens, filtered_category_tokens)
      if similarity > product_categories[product]['similarity']:
        product_categories[product] = {'category': category, 'similarity': similarity}
  
  for product, category in product_categories.items():
    print(f"Product: {product}\nCategory: {category['category']}\n")

# lists of products and categories
products = ["strawberry milk", "milk powder", "vAnilla yoghurt", "mozzarella cheese", "happy cow cheese"]
categories = ["milk", "yoghurt", "cheese"]

# preprocessing
list_of_product_tokens = [preprocess_sentence(product) for product in products]
list_of_category_tokens = [preprocess_sentence(category) for category in categories]
list_of_tokens = list_of_product_tokens   list_of_category_tokens

# model building and training
w2v_model = Word2Vec(sentences=list_of_tokens, min_count=1)
vocabulary = w2v_model.wv.index_to_key

# assigning categories
products.append("pineapple juice")
products
assign_categories(
    product_list=products, 
    category_list=categories
)

output:

Product: strawberry milk
Category: milk

Product: milk powder
Category: milk

Product: vAnilla yoghurt
Category: yoghurt

Product: mozzarella cheese
Category: cheese

Product: happy cow cheese
Category: cheese

Product: pineapple juice
Category: None

Note that instead of objects I used lists to store products and categories. Also did not use phrases, maybe using phrases could slightly improve the answers I managed to obtain- couldn't try that.

CodePudding user response：

First & foremost: a word2vec model that doesn't know a word can't give you a vector for that word. So you should expect errors passing a model words it wasn't trained-to-know. You can't use a hammer to tighten a screw; you can't use a word2vec model to tell you vectors for words that are, as far it knows, nonsense.

Second: it looks like you're trying to get vectors for multiword strings from your Word2Vec model. Such models also don't have any inherent idea of what the vector for a multiple-word string should be. They only learn individual words during the training, so you can only look up individual words directly. In some cases, it may make sense to consider the vector for a run-of-words to be the average of all the individual words' vectors. You could try that.

But also: if you've added a Phrases step, you've actually modified your training data to combine some pairs-of-words into bigrams, based on statistical rules-of-thumb. For example, the 3 tokens ['salty', 'potato', 'chips'] might become (depending on actual relative prevalences in your corpus) ['salty', 'potato_chips']. So before looking-up vectors for post-training texts, you should be applying the same bigram-combination, using the same trained Phrasesr from before for self-consistency. Really, though: using Phrases is a more-advanced technique with other tradeoffs & considerations, so I'd recommend not using it unless/until you've gotten much further with more-basic approaches. Only then, if you have a reasonable theory that it might help, and a way to evaluate if it's helping or not, then add it back in.

More generally & foundationally, though: The task you are actually trying to achieve is "classification": taking items of unknown category/label/class, then recommending a best-fit category/label/class, from a fixed known set. There are many, many ways to attempt that, which vary by (among other things) how items are quantified into 'features' that might give hints as to their category (feature extraction), & what rules are possible to deduce categories (choice & parameterization of algorithms), & how existing knowledge – typically known examples of items labeled with correct classes – gets collected & trained into the chosen models.

You've improvised an ad-hoc classification system based on the idea of summarizing your categories as single vectors, then summarizing new products as single vectors, then assigning each product to the category with the nearest vector.

That a common idea, and an intuitive approach – but also very simplistic compared to the full range of trainable classification algorithms available in free/basic libraries (like scikit-learn). It's both highly sensitive to, and limited by, whatever processes turn prior-knowledge into singular category vectors, and new products into singular product vectors. It resembles in some ways a "K Nearest Neighbors" classifier, but even more constrained: limits to only finding the one nearest known-example; makes that one known-example a single vector (rather than a set of all prior known-members of that category).

So if your true aim is to perform this classification well, I'd suggest putting your current effort aside, and working through some introductory Python machine-learning 'classification' examples – ideally both some where the items to be classified already have quantitative features, and some where the items are texts that need to become various kinds of vectors (most often as a baseline 'bags-of-words'). That will upgrade your general approach, and in the end, you might not even wind up using word2vec at all. (It's just one way to help turn texts into quantitative features for classification algorithms – often helpful for specific situations where a fuzzy sense of texts is valuable, but also often overkill when just starting out.)

If your aim is instead to just learn & play with word2vec, you could continue your existing approach but:

keep in mind the points above about how the model only knows words, won't have any ability to do things with your multi-word strings, but could give you each word's vector individually (which you might then want average togehter or do other things with)
don't necessarily trust whatever examples you've already been copying from – use of Phrases seems premature, and your peculiar model parameters alpha=0.03, min_alpha=0.0007 tend to appear in bad online tutorials & unthinking recopying-without-understanding. (Most users, & especially beginners, never need to specify such non-default values at all.)
take a look at the word2vec-variant FastText as well: it can offer guess-vectors for words it's never seen, constructed from word-fragments, and thus often better-than-nothing, when the new words are vaiant forms (typos/changed-tenses/shared-word-roots) of other known words. (Still, this is far from the main issue with your approach - so really only when everything else is working well, and you see a problem with almost-but-not-quite known words in your post-training data, would you want to consider adding FastText for some residual improvement on corner cases.)