Home > Software design >  The unexpected behavior of an tuple in this regular expression example
The unexpected behavior of an tuple in this regular expression example

Time:04-17

I tested this regular expression

test_reg = r"(?<=\(|\s)\d \s?(?=((?:\s*(?:wh|www)[1-9]?\s?){1,3}))\1\b"

on the following three examples:

high energy densities (695 wh www ) at
high energy densities (695 wh www) at
high energy densities (695 wh www at

From the result (regex101), the ')' in the first example somehow makes the match failed, and delete it solves the problem. I don't understand why.

CodePudding user response:

The ((?:\s*(?:wh|www)[1-9]?\s?){1,3}) capturing group #1 captures the optional whitespace in high energy densities (695 wh www ) at that us right after www and ). When the value is consumed with \1, there is no way to re-match this part of string since backtreferences are non-backtracking'able. Thus, the \1 value with space at the end fails since \b, the word boundary, is not matching the location between the space and ) (both are non-word chars).

Hence, the \s? at the end of Group 1 pattern should be removed, or moved (\s? or \s*) to the end of the expression:

test_reg = r"(?<=[(\s])\d \s?(?=((?:\s*(?:wh|www)[1-9]?){1,3}))\1\b"
test_reg = r"(?<=[(\s])\d \s?(?=((?:\s*(?:wh|www)[1-9]?){1,3}))\1\b\s?"

See the regex demo.

  • Related