Python Sentence Transformer - Get matching Sentence by index order-CodePudding

I have a database table with lot of records. And I am comparing the sentence to find a best match.

lets say the table contains 4 columns: id, sentence, info, updated_date. The data contains as below:

id	sentence	info	updated_info_date
1	What is the name of your company	some distinct info	19/12/2022
2	Company Name	some distinct info	18/12/2022
3	What is the name of your company	some distinct info	17/12/2022
4	What is the name of your company	some distinct info	16/12/2022
5	What is the name of your company	some distinct info	15/12/2022
6	What is the name of your company	some distinct info	14/12/2022
7	What is the name of your company	some distinct info	13/12/2022
8	What is the phone number of your company	some distinct info	12/12/2022
9	What is the name of your company	some distinct info	11/12/2022
10	What is the name of your company	some distinct info	10/12/2022

I have converted these sentences to tensors.

And I am passing this as an example "What is the name of your company"(tensor) to match.

sentence = "What is the name of your company" # in tensor format
cos_scores = util.pytorch_cos_sim(sentence, all_sentences_tensors)[0]

top_results = torch.topk(cos_scores, k=5) 
or
top_results = np.argpartition(cos_scores, range(5))[0:5]

top_results does not return the top results index wise.
As the sentences are same, all will have a score of "1". And it returns the results arbitrarily.

What I want is to get the top 5 matches with the latest updated_date order or the index order.

Is this possible to achieve ?

Any suggestions ?

CodePudding user response：

What I would do is as follows:

Get the cosine similarity scores for each sentence and store them in an array.
Sort the array based on the updated_date
Get the top 5 indices from the sorted array
Get the corresponding sentences from the database table using the indices
This should give you the top 5 matches with the latest updated_date or the index order. Something like this should work:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Create a CountVectorizer object to transform the sentences into vectors
vectorizer = CountVectorizer()

# Transform the sentences into vectors using the CountVectorizer object
vectors = vectorizer.fit_transform(sentences)

# Calculate the cosine similarity scores using the cosine_similarity function
cosine_scores = cosine_similarity(vectors)

# Convert the cosine similarity scores to a 1-dimensional numpy array
cosine_scores = cosine_scores.flatten()

# Sort the array of cosine similarity scores in ascending order
sorted_indices = cosine_scores.argsort()

# Get the top 5 indices from the sorted array
top_5_indices = sorted_indices[-5:]

# Get the corresponding sentences from the database table using the indices
top_5_sentences = [sentences[i] for i in top_5_indices]