how can we determine whether a string has a semantical relation with our phrase or not in python?
Example:
our phrase is:
'Fruit and Vegetables'
and the strings we want to check semantical relation in, are:
'I have an apple in my basket', 'I have a car in my house'
result:
as we know the first item I have an apple in my basket
has a relation to our phrase.
CodePudding user response:
You can use gensim
library to implement MatchSemantic
and write code like this as a function (see full code in here):
Initialization
- install the
gensim
andnumpy
:
pip install numpy
pip install gensim
Code
- first of all, we must implement the requirements
from re import sub
import numpy as np
from gensim.utils import simple_preprocess
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex, SoftCosineSimilarity
- use this function to check if the strings and sentences match the phrase you want.
def MatchSemantic(query_string, documents):
stopwords = ['the', 'and', 'are', 'a']
if len(documents) == 1: documents.append('')
def preprocess(doc):
# Tokenize, clean up input document string
doc = sub(r'<img[^<>] (>|$)', " image_token ", doc)
doc = sub(r'<[^<>] (>|$)', " ", doc)
doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.& ]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F])) ', " url_token ", doc)
return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]
# Preprocess the documents, including the query string
corpus = [preprocess(document) for document in documents]
query = preprocess(query_string)
# Load the model: this is a big file, can take a while to download and open
glove = api.load("glove-wiki-gigaword-50")
similarity_index = WordEmbeddingSimilarityIndex(glove)
# Build the term dictionary, TF-idf model
dictionary = Dictionary(corpus [query])
tfidf = TfidfModel(dictionary=dictionary)
# Create the term similarity matrix.
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)
query_tf = tfidf[dictionary.doc2bow(query)]
index = SoftCosineSimilarity(
tfidf[[dictionary.doc2bow(document) for document in corpus]],
similarity_matrix)
return index[query_tf]
Attention:
if run the code for the first time a process bar will go from 0%
to 100%
for downloading glove-wiki-gigaword-50
of the gensim
and after that everything will be set and you can simply run the code.
Usage
for example, we want to see if Fruit and Vegetables
matches any of the sentences or items inside documents
Test:
query_string = 'Fruit and Vegetables'
documents = ['I have an apple on my basket', 'I have a car in my house']
MatchSemantic(query_string, documents)
so we know that the first item I have an apple on my basket
has a semantical relation with Fruit and Vegetables
so its score will be 0.189
and for the second item no relation will be found so its score will be 0
output:
0.189 # I have an apple in my basket
0.000 # I have a car in my house