Home > Enterprise >  Regex Or condition matching order
Regex Or condition matching order

Time:10-13

I have a question about the matching order of regex concatenated by | operator. I have this regex " ?\p{L} |\s ". For strings like inputs = " s", when I run re.findall(), it is split into " " and "s". My question is - how is the order determined? " ?\p{L} " should give " s", why is the space deleted in the final result? To clarify, I am using python regex.

To reproduce:

import regex as re

pat = re.compile(r" ?\p{L} |\s ")
inputs = "    s"
print(re.findall(pat, inputs))

Many thanks to your help!

CodePudding user response:

Working of regex ?\p{L} |\s matches against input: " s":

  • Matching of regular expression matching is from left to right.
  • First it attempts to find a match for first alternation option \p{L} in input and as you notice there is no match at the start of the input for this option.
  • Next it attempts to find a match for \s and that results in a success hence first match is " ".
  • Now 5 spaces have been consumed in this match and pointer moves to letter s.
  • Then regex engine attempts to match s using alternations again.
  • This time ?\p{L} is successful in matching s hence second match is s.
  • Regex engine stops at this point since it has reached to the end of input.

CodePudding user response:

You can use a negative lookahead pattern to avoid \s consuming the whitespace that ?\p{L} would match:

pat = re.compile(r" ?\p{L} |\s (?!\p{L})")

Demo: https://replit.com/@blhsing/WeeklyVividInverse

  • Related