I have a question about the matching order of regex concatenated by |
operator. I have this regex " ?\p{L} |\s "
. For strings like inputs = " s"
, when I run re.findall()
, it is split into " "
and "s"
. My question is - how is the order determined? " ?\p{L} "
should give " s"
, why is the space deleted in the final result? To clarify, I am using python regex.
To reproduce:
import regex as re
pat = re.compile(r" ?\p{L} |\s ")
inputs = " s"
print(re.findall(pat, inputs))
Many thanks to your help!
CodePudding user response:
Working of regex ?\p{L} |\s
matches against input: " s"
:
- Matching of regular expression matching is from left to right.
- First it attempts to find a match for first alternation option
\p{L}
in input and as you notice there is no match at the start of the input for this option. - Next it attempts to find a match for
\s
and that results in a success hence first match is" "
. - Now 5 spaces have been consumed in this match and pointer moves to letter
s
. - Then regex engine attempts to match
s
using alternations again. - This time
?\p{L}
is successful in matchings
hence second match iss
. - Regex engine stops at this point since it has reached to the end of input.
CodePudding user response:
You can use a negative lookahead pattern to avoid \s
consuming the whitespace that ?\p{L}
would match:
pat = re.compile(r" ?\p{L} |\s (?!\p{L})")