Python Create Column in Pandas Data Frame That Matches Row for Row Values Found / Not-Found in Exist-CodePudding

I have a Pandas data frame that contains a column of numbers and a column of text. The text contains sentences. Some sentences have nation names.

I've succeeded with using Spacy to extract the nation names from the text column. My problem is that I am having problems creating a third column that matches row for row the names of the nations but also adds some trivial text, like 'NA', for rows that do not contain any nation names.

In short, my desired output is a data frame that looks like this:

Numbers	Text	Nation
10	'I went to Mexico'	'mexico'
20	'He ate a sandwich'	'NA'
30	'I went to Canada'	'canada'

I have been able to extract the nation names as a list, and I can later add that list to the data frame, but the list only contains found nations. I need to line up the names of the nations to the sentences that have them and line up some trivial text to the sentences that do not contain nation names. Here is my code so far:

import pandas as pd
import spacy

nlp = spacy.load("en_core_web_sm")

data = [[10, 'I went to Mexico'], [20, 'He ate a sandwich'], [30, 'I went to Canada']]
data = pd.DataFrame(data, columns = ['Numbers', 'Text'])
data['Text'] = data['Text'].astype(str).str.lower()
countries = []
for i in data['Text']:
  doc = nlp(i)
  for ent in doc.ents:
    countries.append(ent.text)

countries
['america', 'canada']

Any help would be appreciated.

CodePudding user response：

Using pycountry:

import pandas as pd
import pycountry


data = [[10, 'I went to Mexico'], [20, 'He ate a sandwich'], [30, 'I went to Canada']]
df = pd.DataFrame(data=data, columns=['Numbers', 'Text'])

countries = [x.name for x in pycountry.countries]
df["Nation"] = df["Text"].str.split(" ").apply(lambda x: ",".join([i for i in x if i in countries])).replace("", "NA")
print(df)

Output:

   Numbers               Text  Nation
0       10   I went to Mexico  Mexico
1       20  He ate a sandwich      NA
2       30   I went to Canada  Canada

CodePudding user response：

You can try:

data['Nation'] = [s if (s:=''.join([ent.text for ent in nlp(i).ents])) else 'NA'
                  for i in data['Text']]

NB. requires python 3.8 .