I have a database table with lot of records. And I am comparing the sentence to find a best match.
lets say the table contains 4 columns: id, sentence, info, updated_date. The data contains as below:
id | sentence | info | updated_info_date |
---|---|---|---|
1 | What is the name of your company | some distinct info | 19/12/2022 |
2 | Company Name | some distinct info | 18/12/2022 |
3 | What is the name of your company | some distinct info | 17/12/2022 |
4 | What is the name of your company | some distinct info | 16/12/2022 |
5 | What is the name of your company | some distinct info | 15/12/2022 |
6 | What is the name of your company | some distinct info | 14/12/2022 |
7 | What is the name of your company | some distinct info | 13/12/2022 |
8 | What is the phone number of your company | some distinct info | 12/12/2022 |
9 | What is the name of your company | some distinct info | 11/12/2022 |
10 | What is the name of your company | some distinct info | 10/12/2022 |
I have converted these sentences to tensors.
And I am passing this as an example "What is the name of your company"(tensor) to match.
sentence = "What is the name of your company" # in tensor format
cos_scores = util.pytorch_cos_sim(sentence, all_sentences_tensors)[0]
top_results = torch.topk(cos_scores, k=5)
or
top_results = np.argpartition(cos_scores, range(5))[0:5]
top_results does not return the top results index wise.
As the sentences are same, all will have a score of "1". And it returns the results arbitrarily.
What I want is to get the top 5 matches with the latest updated_date order or the index order.
Is this possible to achieve ?
Any suggestions ?
CodePudding user response:
What I would do is as follows:
- Get the cosine similarity scores for each sentence and store them in an array.
- Sort the array based on the updated_date
- Get the top 5 indices from the sorted array
- Get the corresponding sentences from the database table using the indices
- This should give you the top 5 matches with the latest updated_date or the index order. Something like this should work:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Create a CountVectorizer object to transform the sentences into vectors
vectorizer = CountVectorizer()
# Transform the sentences into vectors using the CountVectorizer object
vectors = vectorizer.fit_transform(sentences)
# Calculate the cosine similarity scores using the cosine_similarity function
cosine_scores = cosine_similarity(vectors)
# Convert the cosine similarity scores to a 1-dimensional numpy array
cosine_scores = cosine_scores.flatten()
# Sort the array of cosine similarity scores in ascending order
sorted_indices = cosine_scores.argsort()
# Get the top 5 indices from the sorted array
top_5_indices = sorted_indices[-5:]
# Get the corresponding sentences from the database table using the indices
top_5_sentences = [sentences[i] for i in top_5_indices]