I've got the following lines of code and I'd love to turn them into a Pandas data frame, not as a print output.
embedder = SentenceTransformer('all-MiniLM-L6-v2')
corpus = ['About us. About Us · Our Coffees · Starbucks Stories & News · Starbucks® Ready to Drink · Foodservice Coffee · Customer Service · Tax Strategy 2022 · Careers.',
'Costa is the Nation Favourite coffee shop and the largest and fastest growing coffee shop chain in the UK.',
'Leading UK speciality coffee roaster with a focus on sustainability. B Corp certified. Become a wholesale partner or buy coffee beans online today.',
'Kick-start your morning with our amazing range of speciality coffee and equipment. World-class coffee, direct from the farmer, delivered free every time.',
'Coffee Direct - Freshly roasted coffee beans delivered to your door. Origin coffee, coffee blends and flavoured coffee for bean-to-cup',
'Whether you prefer whole coffee beans or freshly ground coffee, Whittard of Chelsea selection of light, medium and dark roast luxury coffees has something',
'Coffee beans are the seeds of a fruit called a coffee cherry. Coffee cherries grow on coffee trees from a genus of plants called Coffea.',
'On these coffee plants, bunches of cherries grow and inside these you will find two coffee beans, Arabica and Robusta coffee.',
]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)
# Query sentences:
queries = ['coffee', 'coffee near me', 'coffee bean', 'coffee house', 'coffee jelly','coffee order nyt crossword clue','coffee quotes', 'coffee shops near me']
# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
query_embedding = embedder.encode(query, convert_to_tensor=True)
cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
top_results = torch.topk(cos_scores, k=top_k)
print("\n\n======================\n\n")
print("Query:", query)
print("\nTop 5 most similar sentences in snippet:")
for score, idx in zip(top_results[0], top_results[1]):
print(corpus[idx], "(Score: {:.4f})".format(score))
Here is an excerpt from the current output:
======================
Query: coffee
Top 5 most similar sentences in snippet: Coffee Direct - Freshly roasted coffee beans delivered to your door. Origin coffee, coffee blends and flavoured coffee for bean-to-cup (Score: 0.6477) Whether you prefer whole coffee beans or freshly ground coffee, Whittard of Chelsea selection of light, medium and dark roast luxury coffees has something (Score: 0.5873) Kick-start your morning with our amazing range of speciality coffee and equipment. World-class coffee, direct from the farmer, delivered free every time. (Score: 0.5739) Coffee beans are the seeds of a fruit called a coffee cherry. Coffee cherries grow on coffee trees from a genus of plants called Coffea. (Score: 0.4985) Costa is the Nation Favourite coffee shop and the largest and fastest growing coffee shop chain in the UK. (Score: 0.4374)
======================
Instead, I'm looking at converting the output into something like
query | corpus | score |
---|---|---|
--- | --- | --- |
CodePudding user response:
You can save all the result in a Python list and the convert it to a pandas Dataframe
. There are a lot of examples out there but here you can check one out.
So your code should be someting like this:
embedder = SentenceTransformer('all-MiniLM-L6-v2')
corpus = [
'About us. About Us · Our Coffees · Starbucks Stories & News · Starbucks® Ready to Drink · Foodservice Coffee · Customer Service · Tax Strategy 2022 · Careers.',
'Costa is the Nation Favourite coffee shop and the largest and fastest growing coffee shop chain in the UK.',
'Leading UK speciality coffee roaster with a focus on sustainability. B Corp certified. Become a wholesale partner or buy coffee beans online today.',
'Kick-start your morning with our amazing range of speciality coffee and equipment. World-class coffee, direct from the farmer, delivered free every time.',
'Coffee Direct - Freshly roasted coffee beans delivered to your door. Origin coffee, coffee blends and flavoured coffee for bean-to-cup',
'Whether you prefer whole coffee beans or freshly ground coffee, Whittard of Chelsea selection of light, medium and dark roast luxury coffees has something',
'Coffee beans are the seeds of a fruit called a coffee cherry. Coffee cherries grow on coffee trees from a genus of plants called Coffea.',
'On these coffee plants, bunches of cherries grow and inside these you will find two coffee beans, Arabica and Robusta coffee.',
]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)
# Query sentences:
queries = ['coffee', 'coffee near me', 'coffee bean', 'coffee house', 'coffee jelly', 'coffee order nyt crossword clue',
'coffee quotes', 'coffee shops near me']
# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
query_result = list() # <=== New code
for query in queries:
query_embedding = embedder.encode(query, convert_to_tensor=True)
cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
top_results = torch.topk(cos_scores, k=top_k)
print("\n\n======================\n\n")
print("Query:", query)
print("\nTop 5 most similar sentences in snippet:")
for score, idx in zip(top_results[0], top_results[1]):
print(corpus[idx], "(Score: {:.4f})".format(score))
query_result.append([query, corpus[idx], score]) # <=== New code
df = pd.DataFrame(query_result) # <=== New code
CodePudding user response:
If you have control over the code that generates the output. Making a list or dict and convert it to DataFrame is clearly the best. However, you can also convert directly from a string to DataFrame