How to do string semantic matching using gensim in Python?-CodePudding

how can we determine whether a string has a semantical relation with our phrase or not in python?

Example:

our phrase is:

'Fruit and Vegetables'

and the strings we want to check semantical relation in, are:

'I have an apple in my basket', 'I have a car in my house'

result:

as we know the first item I have an apple in my basket has a relation to our phrase.

CodePudding user response：

You can use gensim library to implement MatchSemantic and write code like this as a function (see full code in here):

Initialization

install the gensim and numpy:

pip install numpy
pip install gensim

Code

first of all, we must implement the requirements

from re import sub
import numpy as np
from gensim.utils import simple_preprocess
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex, SoftCosineSimilarity

use this function to check if the strings and sentences match the phrase you want.

def MatchSemantic(query_string, documents):
    stopwords = ['the', 'and', 'are', 'a']

    if len(documents) == 1: documents.append('')

    def preprocess(doc):
        # Tokenize, clean up input document string
        doc = sub(r'<img[^<>] (>|$)', " image_token ", doc)
        doc = sub(r'<[^<>] (>|$)', " ", doc)
        doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
        doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.& ]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F])) ', " url_token ", doc)
        return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

    # Preprocess the documents, including the query string
    corpus = [preprocess(document) for document in documents]
    query = preprocess(query_string)

    # Load the model: this is a big file, can take a while to download and open
    glove = api.load("glove-wiki-gigaword-50")
    similarity_index = WordEmbeddingSimilarityIndex(glove)

    # Build the term dictionary, TF-idf model
    dictionary = Dictionary(corpus   [query])
    tfidf = TfidfModel(dictionary=dictionary)

    # Create the term similarity matrix.
    similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)

    query_tf = tfidf[dictionary.doc2bow(query)]

    index = SoftCosineSimilarity(
        tfidf[[dictionary.doc2bow(document) for document in corpus]],
        similarity_matrix)

    return index[query_tf]

Attention: if run the code for the first time a process bar will go from 0% to 100% for downloading glove-wiki-gigaword-50 of the gensim and after that everything will be set and you can simply run the code.

Usage

for example, we want to see if Fruit and Vegetables matches any of the sentences or items inside documents

Test:

query_string = 'Fruit and Vegetables'
documents = ['I have an apple on my basket', 'I have a car in my house']
MatchSemantic(query_string, documents)

so we know that the first item I have an apple on my basket has a semantical relation with Fruit and Vegetables so its score will be 0.189 and for the second item no relation will be found so its score will be 0

output:

0.189    # I have an apple in my basket
0.000    # I have a car in my house