Home > Back-end >  How or * affects previous character in regular expression?
How or * affects previous character in regular expression?

Time:06-23

Guys my question has nothing to do with just or * in regex. I'm really interested on it's affect on preceiding. In my example \w affects the result of \W. and I want to understand this. Please don't close question without answering.

Hope you won't close the question because I tried to find answer, but there was no similar case explained. I have read the documentation, watched a lot of videos about regex but still can't understand one simple issue.

Why two lines below return different output? I mean, if we use which means 1 or more letter or digit it stops on last letter of abcdef, but if we use * whichi means 0 or more it returns "=" too. But why \w or \w* affects the output of previous \W?. I mean charecter "=" should be returned because of "\W?" why then it depends on subsequet \w? Thanks!

print( re.search("\w \W?\w ", "abcdef==ncabcd"))
print( re.search("\w \W?\w*", "abcdef==ncabcd"))

<re.Match object; span=(0, 6), match='abcdef'>
<re.Match object; span=(0, 7), match='abcdef='>

CodePudding user response:

Fascinatingly enough, this is a deceptively interesting case. It involves how the regex engine will match as much as possible on each tag, but will back track if a subsequent tag is not valid based on the preliminary match.

To explain, you will need to examine exactly what the engine does at each step:

(Hyphens indicate what the engine has matched up to that point.)

\w \W?\w

  • Step 1: The \w matches as many word characters as it can with a minimum of one.
abcdef==ncabcd
------
  • Step 2: The \W? matches a non-word character, if one exists.
abcdef==ncabcd
-------
  • Step 3: The \w matches as many word characters as it can with a minimum of one. However, no matching characters are found:
abcdef==ncabcd
-------X
  • Step 4: The previous step didn't find a valid match, so it backs up a bit:
abcdef==ncabcd
-----
  • Step 5: Re-apply the check for \W?. None is found, but the "?" marks it as optional so we can safely continue:
abcdef==ncabcd
-----
  • Step 6: Re-apply the check for \w , and this time one is found:
abcdef==ncabcd
------
  • Step 7: The expression is satisfied, resulting in a match of abcdef.

\w \W?\w*

  • Step 1: The \w matches as many word characters as it can with a minimum of one.
abcdef==ncabcd
------
  • Step 2: The \W? matches a non-word character, if one exists.
abcdef==ncabcd
-------
  • Step 3: The \w* matches as many word characters as it can with no minimum:
abcdef==ncabcd
-------
  • Step 4: The expression is satisfied, resulting in a match of abcdef=.

To see this in action, you can go to https://regex101.com/r/d3ObCZ/1 and select the "Regex Debugger" on the left to see what the engine is doing step by step.

CodePudding user response:

Python regexes wil try to match as much as possible. In your example with this is abcdef since \w doesn't match = (and \w needs to match at least one character).

In the second example the longest possible match is abcdef= since \w* doesn't have to match.

  • Related