I have a column in a dataframe like this.
Text
"Lorum Ipsum Rotterdam dolor sit."
"ed ut perspiciatis Boekarest, New York, consectetur adipiscing elit, sed "
"Excepteur sint occaecat Glasgow cupidatat non proident, sunt in culpa"
I want every geographical location to be replaced by "GPE".
I am using spacy to detect the entities. This works fine, as shown below.
nlp = spacy.load('en_core_web_lg')
for value in df['text']:
doc = nlp(value)
for ent in doc.ents:
print(ent.text, ent.label_)
Output:
Rotterdam GPE
Boekarest GPE
New York GPE
Glasgow GPE
I tried the code below in order to replace the city names within the columns, but it doesn't work.
for value in df['text']:
doc = nlp(value)
for ent in doc.ents:
for word in value.split():
if ent.label_ == "GPE":
word.replace(ent.label, "_GPE_")
Does anyone see what I am doing wrong?
CodePudding user response:
You can use
import spacy, warnings
import pandas as pd
warnings.filterwarnings("ignore", 'User provided device_type of \'cuda\', but CUDA is not available. Disabling')
df = pd.DataFrame({'Text':["Lorum Ipsum Rotterdam dolor sit.", "ed ut perspiciatis Boekarest, New York, consectetur adipiscing elit, sed ", "Excepteur sint occaecat Glasgow cupidatat non proident, sunt in culpa"]})
nlp = spacy.load('en_core_web_lg')
def redact_gpe(text):
doc = nlp(text)
newString = text
for e in reversed(doc.ents):
if e.label_ == "GPE":
start = e.start_char
end = start len(e.text)
newString = f'{newString[:start]}GPE{newString[end:]}'
return newString
df['Text'] = df['Text'].apply(redact_gpe)
Output:
Text
0 Lorum Ipsum GPE dolor sit.
1 ed ut perspiciatis GPE, GPE, consectetur adipiscing elit, sed
2 Excepteur sint occaecat GPE cupidatat non proident, sunt in culpa