I'm filtering a column by a regex expression that checks to see if certain phrases from a list exist in the text field:
phrase = ["email was deleted", "click on link", etc.]
df['text'].str.contains(r'\b(?:{})\b'.format('|'.join(sorted(phrase, key=len, reverse=True))), case=False, regex=True)
However, now I'd like to add a condition to exclude any results that are preceded by a list of phrases/words:
neg_phrases = ["did not", "not", "no"]
So I would expect a row with the phrase "Steve told Mary the email was deleted" anywhere in the text to be in the output, however if it was "Steve told Mary no email was deleted", then it shouldn't. Just having trouble with how to work in the negative lookbehind
CodePudding user response:
Considering there are no space issues in your strings (no double spaces and all spaces are regular \x20
spaces) you can use
pattern = r'\b(?<!{} )(?:{})\b'.format(' )(?<!'.join(neg_phrases),'|'.join(sorted(phrase, key=len, reverse=True)))
See the regex demo.
The \b(?<!did not )(?<!not )(?<!no )(?:email was deleted|click on link)\b
pattern will only match email was deleted
or click on link
if not immediately preceded with did not
, not
or no
followed with a space.
You may also replace a literal space with \s
to match any whitespace:
pattern = r'\b(?<!{}\s)(?:{})\b'.format('\s)(?<!'.join(neg_phrases),'|'.join(sorted(phrase, key=len, reverse=True)))
In case your phrases can contain special chars, they need to be re.escape
d, replace sorted(phrase, key=len, reverse=True)
with sorted(map(re.escape, phrase), key=len, reverse=True)
and replace word boundaries with adaptive dynamic word boundaries:
pattern = r'(?!\B\w)(?<!{}\s)(?:{})(?<!\w\B)'.format('\s)(?<!'.join(map(re.escape, neg_phrases)),'|'.join(sorted(map(re.escape, phrase), key=len, reverse=True)))