How or * affects previous character in regular expression?-CodePudding

Guys my question has nothing to do with just or * in regex. I'm really interested on it's affect on preceiding. In my example \w affects the result of \W. and I want to understand this. Please don't close question without answering.

Hope you won't close the question because I tried to find answer, but there was no similar case explained. I have read the documentation, watched a lot of videos about regex but still can't understand one simple issue.

Why two lines below return different output? I mean, if we use which means 1 or more letter or digit it stops on last letter of abcdef, but if we use * whichi means 0 or more it returns "=" too. But why \w or \w* affects the output of previous \W?. I mean charecter "=" should be returned because of "\W?" why then it depends on subsequet \w? Thanks!

print( re.search("\w \W?\w ", "abcdef==ncabcd"))
print( re.search("\w \W?\w*", "abcdef==ncabcd"))

<re.Match object; span=(0, 6), match='abcdef'>
<re.Match object; span=(0, 7), match='abcdef='>

CodePudding user response：

Fascinatingly enough, this is a deceptively interesting case. It involves how the regex engine will match as much as possible on each tag, but will back track if a subsequent tag is not valid based on the preliminary match.

To explain, you will need to examine exactly what the engine does at each step:

(Hyphens indicate what the engine has matched up to that point.)

\w \W?\w

Step 1: The \w matches as many word characters as it can with a minimum of one.

abcdef==ncabcd
------

Step 2: The \W? matches a non-word character, if one exists.

abcdef==ncabcd
-------

Step 3: The \w matches as many word characters as it can with a minimum of one. However, no matching characters are found:

abcdef==ncabcd
-------X

Step 4: The previous step didn't find a valid match, so it backs up a bit:

abcdef==ncabcd
-----

Step 5: Re-apply the check for \W?. None is found, but the "?" marks it as optional so we can safely continue:

abcdef==ncabcd
-----

Step 6: Re-apply the check for \w , and this time one is found:

abcdef==ncabcd
------

Step 7: The expression is satisfied, resulting in a match of abcdef.

\w \W?\w*

Step 1: The \w matches as many word characters as it can with a minimum of one.

abcdef==ncabcd
------

Step 2: The \W? matches a non-word character, if one exists.

abcdef==ncabcd
-------

Step 3: The \w* matches as many word characters as it can with no minimum:

abcdef==ncabcd
-------

Step 4: The expression is satisfied, resulting in a match of abcdef=.

To see this in action, you can go to https://regex101.com/r/d3ObCZ/1 and select the "Regex Debugger" on the left to see what the engine is doing step by step.

CodePudding user response：

Python regexes wil try to match as much as possible. In your example with this is abcdef since \w doesn't match = (and \w needs to match at least one character).

In the second example the longest possible match is abcdef= since \w* doesn't have to match.