using Spacy to remove names from a data frame in Python 3.9-CodePudding

I am working with spacy package v3.2.1 in Python 3.9 and wanted to understand how I can use it to remove names from a data frame. I tried following the spacy documentation and I am able to identity names correctly, but not understanding how I can remove them. My goal is to remove all names from a specific column of the data frame.

Actual

ID	Comment
A123	I am five years old, and my name is John
X907	Today I met with Dr. Jacob

What I am trying to accomplish

ID	Comment
A123	I am five years old, and my name is
X907	Today I met with Dr.

Code:

#loading packages
import spacy
import pandas as pd
from spacy import displacy


#loading CSV
df = pd.read_csv('names.csv)

#loading spacy large model
nlp = spacy.load("en_core_web_lg")

#checking/testing is spacy large is identifying named entities
df['test_col'] = df['Comment'].apply(lambda x: list(nlp(x).ents))

What my code does

ID	Comment	test_col
A123	I am five years old, and my name is John	[(John)]
X907	Today I met with Dr. Jacob	[(Jacob)]

But how do I go from removing those names from the Comment column? I think I some sort of function that iterates over each row of the data frame and removes the identified entities. Would appreciate your help

Thank you

CodePudding user response：

Here's an idea using the string replace method:

EDIT: Stripping parens off to see if that helps.

df['test_col'] = df['Comment'].apply(lambda x: str(x).replace(str(nlp(x).ents).lstrip('(').rstrip(')')), '')

I typecasted the variables to help with the match, also not sure if it is a str or not. You may need to use an index, and loop it if there are multiple names found in a single comment, but that's the gist of it.

CodePudding user response：

You can use

import spacy
import pandas as pd

# Test dataframe
df = pd.DataFrame({'ID':['A123','X907'], 'Comment':['I am five years old, and my name is John', 'Today I met with Dr. Jacob']})

# Initialize the model
nlp = spacy.load('en_core_web_trf')

def remove_names(text):
    doc = nlp(text)
    newString = text
    for e in reversed(doc.ents):
        if e.label_ == "PERSON": # Only if the entity is a PERSON
            newString = newString[:e.start_char]   newString[e.start_char   len(e.text):]
    return newString

df['Comment'] = df['Comment'].apply(remove_names)
print(df.to_string())

Output:

     ID                               Comment
0  A123  I am five years old, and my name is
1  X907                 Today I met with Dr.