Home > Software design >  Output to a pandas dataframe
Output to a pandas dataframe

Time:06-18

I am extracting quotes from text in the following manner and with the following output:

data = [
        ("\"Hello, nice to meet you,\" said John. Jane said, \"It is nice to meet you as well.\"", {"url": "example1.com", "date": "Jan 1"}),
        ("\"Hello, nice to meet you,\" said John", {"url": "example2.com", "date": "Jan 2"}),
        ]

for record in data:
    doc = textacy.make_spacy_doc(record, lang="en_core_web_sm")
    print(list(textacy.extract.triples.direct_quotations(doc)))
 
'''
[DQTriple(speaker=[John], cue=[said], content="Hello, nice to meet you,"), DQTriple(speaker=[Jane], cue=[said], content="It is nice to meet you as well.")]
[DQTriple(speaker=[John], cue=[said], content="Hello, nice to meet you,")]
'''

My goal is to convert the output into a pandas dataframe along with the metadata from the original dataset. Specifically, I would like it to look like this:

import pandas as pd

output = {"url": ["example1.com", "example1.com", "example2.com"],
          "date": ["Jan 1", "Jan 1", "Jan 2"],
          "speaker": ["John", "John", "Jane"],
          "cue": ["said", "said", "said"],
          "content": ["Hello, nice to meet you", "It is nice to meet you as well", "Hello, nice to meet you"]}

df = pd.DataFrame(output)

print(df)

'''
            url   date speaker   cue                         content
0  example1.com  Jan 1    John  said         Hello, nice to meet you
1  example1.com  Jan 1    John  said  It is nice to meet you as well
2  example2.com  Jan 2    Jane  said         Hello, nice to meet you

'''

Is there an efficient way to do this?

CodePudding user response:

In your case

l = []
for record in data:
    doc = textacy.make_spacy_doc(record, lang="en_core_web_sm")
    l.append(list(textacy.extract.triples.direct_quotations(doc)))

out = pd.Series(l).explode().apply(pd.Series)
  • Related