How to count the number of nouns from Spacy from a dataframe column?-CodePudding

I have a dataframe like that (as an example).

text
I left the country.
Andrew is from America and he loves apples.

I want to add a new column, number of nouns, where Spacy should count the NOUNS pos tags. How do I convert that in Python?

import pandas as pd
import spacy

# the dataframe

# NLP Spacy with POS tags
nlp = spacy.load("en_core_web_sm")

My question is, how to apply nlp on the "text" column, check if the pos is NOUN and count it and give it as a feature?

Thanks!

CodePudding user response：

First I am creating a demo dataframe:

import spacy
import pandas as pd
nlp = spacy.load("en_core_web_sm")
df = pd.DataFrame([["I left the country"],["Andrew is from America and he loves apples."]],columns=["text"])

It looks like this:

m=[]   # empty list to save values
for x in range(len(df['text'])): #  here you can have any number of rows in dataframe
  doc=nlp(df['text'][x])  #here we are applying nlp on each row from text column in dataframe.
  for n in doc.noun_chunks:
    m.append(n.text)
print(m)
print(len(m)) # this gives the count of number of nouns in all text rows.

CodePudding user response：

You can use apply in pandas like below:

import spacy
import pandas as pd
import collections

sp = spacy.load("en_core_web_sm")
df = pd.DataFrame({'text':['I left the country and city', 
                           'Andrew is from America and he loves apples and bananas']})

# >>> df
#     text
# 0   I left the country and city
# 1   Andrew is from America and he loves apples and bananas

def count_noun(x):
    res = [token.pos_ for token in sp(x)]
    return collections.Counter(res)['NOUN']

df['C_NOUN'] = df['text'].apply(count_noun)
print(df)

Output:

                                                     text     C_NOUN
0                             I left the country and city     2
1  Andrew is from America and he loves apples and bananas     2

If you want to get the list of nouns and count of them you can try this:

def count_noun(x):
    nouns = [token.text for token in sp(x) if token.pos_=='NOUN']
    return [nouns, len(nouns)]

df[['list_NOUN','C_NOUN']] = pd.DataFrame(df['text'].apply(count_noun).tolist())
print(df)

Output:

                             text          list_NOUN    C_NOUN
0     I left the country and city    [country, city]    2
1   Andrew ... apples and bananas  [apples, bananas]    2