Home > Software engineering >  Regex - negative lookbehind for any character excluding pure whitespace
Regex - negative lookbehind for any character excluding pure whitespace

Time:04-08

I'm trying to write a regex pattern that will fail a match if the preceding pattern contains any character except pure whitespace, for example

--hello (match)
--goodbye (match)
ROW_NUMBER() OVER (ORDER BY DATE) --date (fail)
  --comment with some indentation (match)
    --another comment with some indentation (match)

The closest I've got to is with this pattern I made (?<!.)--.*\n, that gives me this result

--hello (match)
--goodbye (match)
ROW_NUMBER() OVER (ORDER BY DATE) --date (fail)
  --comment with some indentation (fail)
    --another comment with some indentation (fail)

I've tried (?<!\s)--.*\n and (?<=\S)--.*\n but both return no matches at all

EDIT: a regexr.com illustrating the issue more clearly regexr.com/6j0mt

CodePudding user response:

With PyPi regex, you can use

import regex

text = r"""--hello
--goodbye
ROW_NUMBER() OVER (ORDER BY DATE) --date
  --comment with some indentation
    --another comment with some indentation"""

print( regex.findall(r'(?<=^[^\S\r\n]*)--.*', text, regex.M) )
# => ['--hello', '--goodbye', '--comment with some indentation', '--another comment with some indentation']

See this Python demo online.

Or, with the default Python re:

import re
 
text = r"""--hello
--goodbye
ROW_NUMBER() OVER (ORDER BY DATE) --date
  --comment with some indentation
    --another comment with some indentation"""
 
print( re.findall(r'^[^\S\r\n]*(--.*)', text, re.M) )

See this Python demo.

Pattern details

  • (?<=^[^\S\r\n]*) - a positive lookbehind that matches a location that is immediately preceded with start of string/line and zero or more horizontal whitespaces
  • ^ - start of a string (here, a line, because re.M / regex.M option is used)
  • [^\S\r\n]* - zero or more chars other than non-whitespace, CR and LF chars (any whitespace but carriage returns and line feed chars)
  • (--.*) - Group 1: -- and the rest of the line (.* matches zero or more chars other than line break chars as many as possible).
  • Related