Regex negative look behind isn't working as expected-CodePudding

In python I used this regex

(?<!\d\d\d)(\s?lt\.?\s?blue)

on this string

ltblue
500lt.blue
4009 lt blue
lt. blue
032 lt red

I expected it to capture this

ltblue
lt. blue

but instead it captured

ltblue
lt. blue
lt blue

From how I wrote it I don't think it should have captured the 'lt blue' after 4009, but for some reason the \s? before 'lt' doesnt seem to work, anyone know how I could change the regex to get the expected output?

CodePudding user response：

Regex will try to match your pattern by all means so if \s is optional, it will try with and without and keep the one matching. In the case of 4009 lt blue it matches if there is no space in the group (the space is before the group, fooling your lookbehind).

Since lookbehinds must have fixed width in python, you cannot add \s? to your negative lookbehind but you can still handle this case in another one:

(?<!\d{3})(?<!\d{3}\s)(lt\.?\s?blue)

CodePudding user response：

If the numbers always appear at the beginning of the string and there is nothing before the numbers in any line then you can use this: ^(?![\d ] )(lt[ .]*blue)

Demo: https://regex101.com/r/sR18Rz/1

The reason why your pattern matched '4009 lt blue' is because, before the l, the \s? matched a whitespace zero times and 'l' is not preceded by three numbers.

CodePudding user response：

As an alternative you can use the Pypi regex module adding an optional \s? to the lookbehind, and you can omit the capture group for a match only.

import regex as re

pattern = r"(?<!\d\d\d\s?)lt\.?\s?blue\b"

s = ("ltblue\n"
"500lt.blue\n"
"4009 lt blue\n"
"lt. blue\n"
"032 lt red")

print(re.findall(pattern, s))

See a regex demo and a Python demo.

Output

['ltblue', 'lt. blue']