Home > Software engineering >  Python Sentence Transformer - Get matching Sentence by index order
Python Sentence Transformer - Get matching Sentence by index order

Time:12-22

I have a database table with lot of records. And I am comparing the sentence to find a best match.

lets say the table contains 4 columns: id, sentence, info, updated_date. The data contains as below:

id sentence info updated_info_date
1 What is the name of your company some distinct info 19/12/2022
2 Company Name some distinct info 18/12/2022
3 What is the name of your company some distinct info 17/12/2022
4 What is the name of your company some distinct info 16/12/2022
5 What is the name of your company some distinct info 15/12/2022
6 What is the name of your company some distinct info 14/12/2022
7 What is the name of your company some distinct info 13/12/2022
8 What is the phone number of your company some distinct info 12/12/2022
9 What is the name of your company some distinct info 11/12/2022
10 What is the name of your company some distinct info 10/12/2022

I have converted these sentences to tensors.

And I am passing this as an example "What is the name of your company"(tensor) to match.

sentence = "What is the name of your company" # in tensor format
cos_scores = util.pytorch_cos_sim(sentence, all_sentences_tensors)[0]

top_results = torch.topk(cos_scores, k=5) 
or
top_results = np.argpartition(cos_scores, range(5))[0:5]

top_results does not return the top results index wise.
As the sentences are same, all will have a score of "1". And it returns the results arbitrarily.

What I want is to get the top 5 matches with the latest updated_date order or the index order.

Is this possible to achieve ?

Any suggestions ?

CodePudding user response:

What I would do is as follows:

  1. Get the cosine similarity scores for each sentence and store them in an array.
  2. Sort the array based on the updated_date
  3. Get the top 5 indices from the sorted array
  4. Get the corresponding sentences from the database table using the indices
  5. This should give you the top 5 matches with the latest updated_date or the index order. Something like this should work:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Create a CountVectorizer object to transform the sentences into vectors
vectorizer = CountVectorizer()

# Transform the sentences into vectors using the CountVectorizer object
vectors = vectorizer.fit_transform(sentences)

# Calculate the cosine similarity scores using the cosine_similarity function
cosine_scores = cosine_similarity(vectors)

# Convert the cosine similarity scores to a 1-dimensional numpy array
cosine_scores = cosine_scores.flatten()

# Sort the array of cosine similarity scores in ascending order
sorted_indices = cosine_scores.argsort()

# Get the top 5 indices from the sorted array
top_5_indices = sorted_indices[-5:]

# Get the corresponding sentences from the database table using the indices
top_5_sentences = [sentences[i] for i in top_5_indices]
  • Related