I have a Pandas data frame that contains a column of numbers and a column of text. The text contains sentences. Some sentences have nation names.
I've succeeded with using Spacy to extract the nation names from the text column. My problem is that I am having problems creating a third column that matches row for row the names of the nations but also adds some trivial text, like 'NA', for rows that do not contain any nation names.
In short, my desired output is a data frame that looks like this:
Numbers | Text | Nation |
---|---|---|
10 | 'I went to Mexico' | 'mexico' |
20 | 'He ate a sandwich' | 'NA' |
30 | 'I went to Canada' | 'canada' |
I have been able to extract the nation names as a list, and I can later add that list to the data frame, but the list only contains found nations. I need to line up the names of the nations to the sentences that have them and line up some trivial text to the sentences that do not contain nation names. Here is my code so far:
import pandas as pd
import spacy
nlp = spacy.load("en_core_web_sm")
data = [[10, 'I went to Mexico'], [20, 'He ate a sandwich'], [30, 'I went to Canada']]
data = pd.DataFrame(data, columns = ['Numbers', 'Text'])
data['Text'] = data['Text'].astype(str).str.lower()
countries = []
for i in data['Text']:
doc = nlp(i)
for ent in doc.ents:
countries.append(ent.text)
countries
['america', 'canada']
Any help would be appreciated.
CodePudding user response:
Using pycountry:
import pandas as pd
import pycountry
data = [[10, 'I went to Mexico'], [20, 'He ate a sandwich'], [30, 'I went to Canada']]
df = pd.DataFrame(data=data, columns=['Numbers', 'Text'])
countries = [x.name for x in pycountry.countries]
df["Nation"] = df["Text"].str.split(" ").apply(lambda x: ",".join([i for i in x if i in countries])).replace("", "NA")
print(df)
Output:
Numbers Text Nation
0 10 I went to Mexico Mexico
1 20 He ate a sandwich NA
2 30 I went to Canada Canada
CodePudding user response:
You can try:
data['Nation'] = [s if (s:=''.join([ent.text for ent in nlp(i).ents])) else 'NA'
for i in data['Text']]
NB. requires python 3.8 .