Regex - how to factor in 'unless preceded by certain phrases'-CodePudding

I'm filtering a column by a regex expression that checks to see if certain phrases from a list exist in the text field:

phrase = ["email was deleted", "click on link", etc.]
df['text'].str.contains(r'\b(?:{})\b'.format('|'.join(sorted(phrase, key=len, reverse=True))), case=False, regex=True)

However, now I'd like to add a condition to exclude any results that are preceded by a list of phrases/words:

neg_phrases = ["did not", "not", "no"]

So I would expect a row with the phrase "Steve told Mary the email was deleted" anywhere in the text to be in the output, however if it was "Steve told Mary no email was deleted", then it shouldn't. Just having trouble with how to work in the negative lookbehind

CodePudding user response：

Considering there are no space issues in your strings (no double spaces and all spaces are regular \x20 spaces) you can use

pattern = r'\b(?<!{} )(?:{})\b'.format(' )(?<!'.join(neg_phrases),'|'.join(sorted(phrase, key=len, reverse=True)))

See the regex demo.

The \b(?<!did not )(?<!not )(?<!no )(?:email was deleted|click on link)\b pattern will only match email was deleted or click on link if not immediately preceded with did not, not or no followed with a space.

You may also replace a literal space with \s to match any whitespace:

pattern = r'\b(?<!{}\s)(?:{})\b'.format('\s)(?<!'.join(neg_phrases),'|'.join(sorted(phrase, key=len, reverse=True)))

In case your phrases can contain special chars, they need to be re.escaped, replace sorted(phrase, key=len, reverse=True) with sorted(map(re.escape, phrase), key=len, reverse=True) and replace word boundaries with adaptive dynamic word boundaries:

pattern = r'(?!\B\w)(?<!{}\s)(?:{})(?<!\w\B)'.format('\s)(?<!'.join(map(re.escape, neg_phrases)),'|'.join(sorted(map(re.escape, phrase), key=len, reverse=True)))