Guys my question has nothing to do with just or * in regex. I'm really interested on it's affect on preceiding. In my example \w affects the result of \W. and I want to understand this. Please don't close question without answering.
Hope you won't close the question because I tried to find answer, but there was no similar case explained. I have read the documentation, watched a lot of videos about regex but still can't understand one simple issue.
Why two lines below return different output? I mean, if we use which means 1 or more letter or digit it stops on last letter of abcdef, but if we use * whichi means 0 or more it returns "=" too. But why \w or \w* affects the output of previous \W?. I mean charecter "=" should be returned because of "\W?" why then it depends on subsequet \w? Thanks!
print( re.search("\w \W?\w ", "abcdef==ncabcd"))
print( re.search("\w \W?\w*", "abcdef==ncabcd"))
<re.Match object; span=(0, 6), match='abcdef'>
<re.Match object; span=(0, 7), match='abcdef='>
CodePudding user response:
Fascinatingly enough, this is a deceptively interesting case. It involves how the regex engine will match as much as possible on each tag, but will back track if a subsequent tag is not valid based on the preliminary match.
To explain, you will need to examine exactly what the engine does at each step:
(Hyphens indicate what the engine has matched up to that point.)
\w \W?\w
- Step 1: The
\w
matches as many word characters as it can with a minimum of one.
abcdef==ncabcd
------
- Step 2: The
\W?
matches a non-word character, if one exists.
abcdef==ncabcd
-------
- Step 3: The
\w
matches as many word characters as it can with a minimum of one. However, no matching characters are found:
abcdef==ncabcd
-------X
- Step 4: The previous step didn't find a valid match, so it backs up a bit:
abcdef==ncabcd
-----
- Step 5: Re-apply the check for
\W?
. None is found, but the "?" marks it as optional so we can safely continue:
abcdef==ncabcd
-----
- Step 6: Re-apply the check for
\w
, and this time one is found:
abcdef==ncabcd
------
- Step 7: The expression is satisfied, resulting in a match of
abcdef
.
\w \W?\w*
- Step 1: The
\w
matches as many word characters as it can with a minimum of one.
abcdef==ncabcd
------
- Step 2: The
\W?
matches a non-word character, if one exists.
abcdef==ncabcd
-------
- Step 3: The
\w*
matches as many word characters as it can with no minimum:
abcdef==ncabcd
-------
- Step 4: The expression is satisfied, resulting in a match of
abcdef=
.
To see this in action, you can go to https://regex101.com/r/d3ObCZ/1 and select the "Regex Debugger" on the left to see what the engine is doing step by step.
CodePudding user response:
Python regexes wil try to match as much as possible. In your example with
this is abcdef
since \w
doesn't match =
(and \w
needs to match at least one character).
In the second example the longest possible match is abcdef=
since \w*
doesn't have to match.