How do you convert a print output into a Pandas data frame in Python?-CodePudding

I've got the following lines of code and I'd love to turn them into a Pandas data frame, not as a print output.

embedder = SentenceTransformer('all-MiniLM-L6-v2')


corpus = ['About us. About Us · Our Coffees · Starbucks Stories & News · Starbucks® Ready to Drink · Foodservice Coffee · Customer Service · Tax Strategy 2022 · Careers.',
          'Costa is the Nation Favourite coffee shop and the largest and fastest growing coffee shop chain in the UK.',
          'Leading UK speciality coffee roaster with a focus on sustainability. B Corp certified. Become a wholesale partner or buy coffee beans online today.',
          'Kick-start your morning with our amazing range of speciality coffee and equipment. World-class coffee, direct from the farmer, delivered free every time.',
          'Coffee Direct - Freshly roasted coffee beans delivered to your door. Origin coffee, coffee blends and flavoured coffee for bean-to-cup',
          'Whether you prefer whole coffee beans or freshly ground coffee, Whittard of Chelsea selection of light, medium and dark roast luxury coffees has something',
          'Coffee beans are the seeds of a fruit called a coffee cherry. Coffee cherries grow on coffee trees from a genus of plants called Coffea.',
          'On these coffee plants, bunches of cherries grow and inside these you will find two coffee beans, Arabica and Robusta coffee.',
          ]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['coffee', 'coffee near me', 'coffee bean', 'coffee house', 'coffee jelly','coffee order nyt crossword clue','coffee quotes', 'coffee shops near me']


# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in snippet:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))

Here is an excerpt from the current output:

======================

Query: coffee

Top 5 most similar sentences in snippet: Coffee Direct - Freshly roasted coffee beans delivered to your door. Origin coffee, coffee blends and flavoured coffee for bean-to-cup (Score: 0.6477) Whether you prefer whole coffee beans or freshly ground coffee, Whittard of Chelsea selection of light, medium and dark roast luxury coffees has something (Score: 0.5873) Kick-start your morning with our amazing range of speciality coffee and equipment. World-class coffee, direct from the farmer, delivered free every time. (Score: 0.5739) Coffee beans are the seeds of a fruit called a coffee cherry. Coffee cherries grow on coffee trees from a genus of plants called Coffea. (Score: 0.4985) Costa is the Nation Favourite coffee shop and the largest and fastest growing coffee shop chain in the UK. (Score: 0.4374)

======================

Instead, I'm looking at converting the output into something like

query	corpus	score
---	---	---

CodePudding user response：

You can save all the result in a Python list and the convert it to a pandas Dataframe. There are a lot of examples out there but here you can check one out.

So your code should be someting like this:

embedder = SentenceTransformer('all-MiniLM-L6-v2')

corpus = [
    'About us. About Us · Our Coffees · Starbucks Stories & News · Starbucks® Ready to Drink · Foodservice Coffee · Customer Service · Tax Strategy 2022 · Careers.',
    'Costa is the Nation Favourite coffee shop and the largest and fastest growing coffee shop chain in the UK.',
    'Leading UK speciality coffee roaster with a focus on sustainability. B Corp certified. Become a wholesale partner or buy coffee beans online today.',
    'Kick-start your morning with our amazing range of speciality coffee and equipment. World-class coffee, direct from the farmer, delivered free every time.',
    'Coffee Direct - Freshly roasted coffee beans delivered to your door. Origin coffee, coffee blends and flavoured coffee for bean-to-cup',
    'Whether you prefer whole coffee beans or freshly ground coffee, Whittard of Chelsea selection of light, medium and dark roast luxury coffees has something',
    'Coffee beans are the seeds of a fruit called a coffee cherry. Coffee cherries grow on coffee trees from a genus of plants called Coffea.',
    'On these coffee plants, bunches of cherries grow and inside these you will find two coffee beans, Arabica and Robusta coffee.',
]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['coffee', 'coffee near me', 'coffee bean', 'coffee house', 'coffee jelly', 'coffee order nyt crossword clue',
           'coffee quotes', 'coffee shops near me']

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
query_result = list()  # <=== New code
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in snippet:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))
        query_result.append([query, corpus[idx], score]) # <=== New code
df = pd.DataFrame(query_result) # <=== New code

CodePudding user response：

If you have control over the code that generates the output. Making a list or dict and convert it to DataFrame is clearly the best. However, you can also convert directly from a string to DataFrame