Home > Net >  using Spacy to remove names from a data frame in Python 3.9
using Spacy to remove names from a data frame in Python 3.9

Time:03-04

I am working with spacy package v3.2.1 in Python 3.9 and wanted to understand how I can use it to remove names from a data frame. I tried following the spacy documentation and I am able to identity names correctly, but not understanding how I can remove them. My goal is to remove all names from a specific column of the data frame.

Actual

ID Comment
A123 I am five years old, and my name is John
X907 Today I met with Dr. Jacob

What I am trying to accomplish

ID Comment
A123 I am five years old, and my name is
X907 Today I met with Dr.

Code:

#loading packages
import spacy
import pandas as pd
from spacy import displacy


#loading CSV
df = pd.read_csv('names.csv)

#loading spacy large model
nlp = spacy.load("en_core_web_lg")

#checking/testing is spacy large is identifying named entities
df['test_col'] = df['Comment'].apply(lambda x: list(nlp(x).ents)) 

What my code does

ID Comment test_col
A123 I am five years old, and my name is John [(John)]
X907 Today I met with Dr. Jacob [(Jacob)]

But how do I go from removing those names from the Comment column? I think I some sort of function that iterates over each row of the data frame and removes the identified entities. Would appreciate your help

Thank you

CodePudding user response:

Here's an idea using the string replace method:

EDIT: Stripping parens off to see if that helps.

df['test_col'] = df['Comment'].apply(lambda x: str(x).replace(str(nlp(x).ents).lstrip('(').rstrip(')')), '')

I typecasted the variables to help with the match, also not sure if it is a str or not. You may need to use an index, and loop it if there are multiple names found in a single comment, but that's the gist of it.

CodePudding user response:

You can use

import spacy
import pandas as pd

# Test dataframe
df = pd.DataFrame({'ID':['A123','X907'], 'Comment':['I am five years old, and my name is John', 'Today I met with Dr. Jacob']})

# Initialize the model
nlp = spacy.load('en_core_web_trf')

def remove_names(text):
    doc = nlp(text)
    newString = text
    for e in reversed(doc.ents):
        if e.label_ == "PERSON": # Only if the entity is a PERSON
            newString = newString[:e.start_char]   newString[e.start_char   len(e.text):]
    return newString

df['Comment'] = df['Comment'].apply(remove_names)
print(df.to_string())

Output:

     ID                               Comment
0  A123  I am five years old, and my name is
1  X907                 Today I met with Dr.
  • Related