I am trying to specify a non-greedy operator for a character that comes right before a negative lookahead. Nevertheless, when I specify the non-greedy operator, the word is captured even if the pattern present in the negative lookahead appears in the string.
The example is here
pattern = "\bcuerpos?(?!\W*de\sseguridad)"
strings = ["el cuerpo", "cuerpo a cuerpo",
"fuerzas y cuerpos de seguridad", "fuerzas y cuerpo de seguridad"]
I want this patter not to match the third or the fourth strings, but it does capture the third one. Why does the non-greedy operator erase the negative lookahead functionality here and how would you solve this?
CodePudding user response:
You need to move s?
inside the negative lookahead to make it:
\bcuerpo(?!s?\W de\sseguridad)
Or if you want to match optional s
as well then use:
\bcuerpo(?!s?\W de\sseguridad)s?
CodePudding user response:
I think you can still get false positives in two ways:
- A line like
cuerpo a cuerpo de seguridad
contains 'cuerpo' without what should be negated, yet it's also present in the form you'd like to exclude; - You currently are not using any boundary at the end of 'cuerpo', meaning it could be part of 'cuerpoblah'.
To counter both the above and the optional 's' and assuming you'd also want this to be case-insensitive I came up with:
import re
strings = ["el cuerpo", "cuerpo a cuerpo",
"fuerzas y cuerpos de seguridad",
"fuerzas y cuerpo de seguridad",
"cuerpo a cuerpo de seguridad"]
p = re.compile(r'^(?!.*\bcuerpos?\b\W de\sseguridad).*\bcuerpos?\b.*$', re.I)
new_strings = [s for s in strings if p.match(s)]
print(new_strings)
Prints:
['el cuerpo', 'cuerpo a cuerpo']
See an online demo
^
- Start-line anchor;(?!.*
- Open negative lookahead followed by 0 characters upto;\bcuerpos?\b
- Match literally 'cuerpo' with an optional possessive (you could also remove the ' ') 's' between word-boundaries;\W de\sseguridad)
- Match 1 non-word characters upto 'de', a single space before 'seguridad' and close the lookahead. Note This could even lead to a false negative since I don't know if you'd still want to allow a match when its 'seguridadblah`. A possible word-boundary could fix that too;
.*\bcuerpos?\b.
- Match 0 characters upto the same pattern used for 'cuerpo' before and again 0 characters;$
- End-line anchor.