I am extracting information about chemical elements from Wikipedia. It contains sentences, and I want each sentence to be added as follows:
Molecule | Sentence1 | Sentence1 and sentence2 | All_sentence |
---|---|---|---|
MgO | this is s1. | this is s1. this is s2. | all_sentence |
CaO | this is s1. | this is s1. this is s2. | all_sentence |
What I've achieved so far
import spacy
import pandas as pd
import wikipediaapi
import csv
wiki_wiki = wikipediaapi.Wikipedia('en')
chemical = input("Write the name of molecule: ")
page_py = wiki_wiki.page(chemical)
sumary = page_py.summary[0:]
nlp = spacy.load('en_core_web_sm')
text_sentences = nlp(sumary)
sent_list = []
for sentence in text_sentences.sents:
sent_list.append(sentence.text)
#print(sent_list)
df = pd.DataFrame(
{'Molecule': chemical,
'Description': sent_list})
print(df.head())
The output looks like:
Molecule | Description |
---|---|
MgO | All sentences are here |
Mgo |
The Molecule columns are shown repeatedly for each line of sentence which is not correct. Please suggest some solution
CodePudding user response:
It's not clear why you would want to repeat all sentences in each column but you can get to the form you want with pivot
:
import spacy
import pandas as pd
import wikipediaapi
import csv
wiki_wiki = wikipediaapi.Wikipedia('en')
chemical = input("Write the name of molecule: ")
page_py = wiki_wiki.page(chemical)
sumary = page_py.summary[0:]
nlp = spacy.load('en_core_web_sm')
sent_list = [sent.text for sent in nlp(sumary).sents]
#cumul_sent_list = [' '.join(sent_list[:i]) for i in range(1, len(sent_list) 1)] # uncomment to cumulate sentences in columns
df = pd.DataFrame(
{'Molecule': chemical,
'Description': sent_list}) # replace sent_list with cumul_sent_list if cumulating
df["Sentences"] = pd.Series([f"Sentence{i 1}" for i in range(len(df))]) # replace "Sentence{i 1}" with "Sentence1-{i 1}" if cumulating
df = df.pivot(index="Molecule", columns="Sentences", values="Description")
print(df)
sent_list
can be created using a list comprehension. Create cumul_sent_list
if you want your sentences to be repeated in columns.
Output:
Sentences Sentence1 ... Sentence9
Molecule ...
MgO Magnesium oxide (MgO), or magnesia, is a white... ... According to evolutionary crystal structure pr...