I am working with spacy package v3.2.1 in Python 3.9 and wanted to understand how I can use it to remove names from a data frame. I tried following the spacy documentation and I am able to identity names correctly, but not understanding how I can remove them. My goal is to remove all names from a specific column of the data frame.
Actual
ID | Comment |
---|---|
A123 | I am five years old, and my name is John |
X907 | Today I met with Dr. Jacob |
What I am trying to accomplish
ID | Comment |
---|---|
A123 | I am five years old, and my name is |
X907 | Today I met with Dr. |
Code:
#loading packages
import spacy
import pandas as pd
from spacy import displacy
#loading CSV
df = pd.read_csv('names.csv)
#loading spacy large model
nlp = spacy.load("en_core_web_lg")
#checking/testing is spacy large is identifying named entities
df['test_col'] = df['Comment'].apply(lambda x: list(nlp(x).ents))
What my code does
ID | Comment | test_col |
---|---|---|
A123 | I am five years old, and my name is John | [(John)] |
X907 | Today I met with Dr. Jacob | [(Jacob)] |
But how do I go from removing those names from the Comment column? I think I some sort of function that iterates over each row of the data frame and removes the identified entities. Would appreciate your help
Thank you
CodePudding user response:
Here's an idea using the string replace
method:
EDIT: Stripping parens off to see if that helps.
df['test_col'] = df['Comment'].apply(lambda x: str(x).replace(str(nlp(x).ents).lstrip('(').rstrip(')')), '')
I typecasted the variables to help with the match, also not sure if it is a str or not. You may need to use an index, and loop it if there are multiple names found in a single comment, but that's the gist of it.
CodePudding user response:
You can use
import spacy
import pandas as pd
# Test dataframe
df = pd.DataFrame({'ID':['A123','X907'], 'Comment':['I am five years old, and my name is John', 'Today I met with Dr. Jacob']})
# Initialize the model
nlp = spacy.load('en_core_web_trf')
def remove_names(text):
doc = nlp(text)
newString = text
for e in reversed(doc.ents):
if e.label_ == "PERSON": # Only if the entity is a PERSON
newString = newString[:e.start_char] newString[e.start_char len(e.text):]
return newString
df['Comment'] = df['Comment'].apply(remove_names)
print(df.to_string())
Output:
ID Comment
0 A123 I am five years old, and my name is
1 X907 Today I met with Dr.