I'm working on an NLP
project and using Spacy
. Now, I have identified different entities using NER
of Spacy, and I want to remove the ORG (those identified as Organisations) from the original input string.
doc = "I'm here with the three of Nikkei Asia's stalwart editors, three Brits in Tokyo. First off, we have Michael Peel, who is executive editor, a journalist from our affiliate, The Financial Times . He is now in Tokyo but has previously reported from the likes of Brussels, Bangkok, Abu Dhabi and Lagos. Welcome, Michael.MICHAEL PEEL, EXECUTIVE EDITOR: Welcome Waj. Thank you very much.KHAN: All right. And we have Stephen Foley, our business editor who, like Michael, is on secondment from the FT, where he was deputy U.S. News Editor. Prior to the FT, he was a reporter at The Independent and like Michael, he's a fresh-off-the-boat arrival in Tokyo and has left some pretty big shoes to fill in the New York bureau, where we miss him. Welcome, Stephen.STEPHEN FOLEY, BUSINESS EDITOR: Thanks for having me, Waj.KHAN: Alright, and last but certainly not least, my brother in arms when it comes to cricket commentary across the high seas is Andy Sharp, or deputy editor who joined Nikkei Asia nearly four years ago, after a long stint at Bloomberg in Tokyo and other esteemed Japanese publications. Welcome, Andy.ANDREW SHARP"
text = NER(doc)
org_stopwords = [ent.text for ent in text.ents if ent.label_ == 'ORG']
output of org_stopwords
['The Financial Times ', 'Abu Dhabi and Lagos', 'Bloomberg ']
This is my code now, I've identified and made a list with all those identified as ORG by Spacy, but now I don't know how to remove those from the string. One problem I'm facing to normally split the string and remove the org_stopwords
is beacause org_stopwords ar n-grams
. Please help with some coded example how to tackle this issue.
CodePudding user response:
Use regex instead of replace
import re
org_stopwords = ['The Financial Times',
'Abu Dhabi ',
'U.S. News Editor',
'Independent',
'ANDREW']
regex = re.compile('|'.join(org_stopwords))
new_doc = re.sub(regex, '', doc)