Home > database >  Regex conditional lookahead issue
Regex conditional lookahead issue

Time:06-09

I have a conditional lookahead regex that tests to see if there is a number substring at the end of a string, and if so match for the numbers, and if not, match for another substring

The string in question: "H2K 101"

If just the lookahead is used, i.e. (?=\d{1,8}$)(\d{1,8}$), the lookahead succeeds, and "101" is found in capture group 1

When the lookahead is placed into a conditional, i.e. (?(?=\d{1,8}\z)(\d{1,8}\z)|([a-zA-Z] [\d_-]{1,8}[a-zA-Z] )), the lookahead now fails, and the second pattern is used, matching "H2K", and a "2" is found in capture group 2.

If the test string has the "2" swapped for a letter, i.e. "HKK 101" then the lookahead conditional works as expected, and the number "101" is once again found in capture group 1.

I've tested this in Regex101 and other PCRE engines, and all work the same, so clearly I'm missing something obvious about conditionals or the condition regex I'm using. Any insight greatly appreciated.

Thanks.

CodePudding user response:

The look ahead starts at the current position, so initially it fails, and the alternative is used -- where it finds a match at the current position.

If you want the look ahead to succeed when still at the initial position, you need to allow for the intermediate characters to occur. Also, when the alternative kicks in, realise that there can follow a second match that still uses the look ahead, but now at a position where the look ahead is successful.

From what I understand, you are interested in one match only, not two consecutive matches (or more). So that means you should attempt to match the whole string, and capture the part of interest in a capture group. Also, the look ahead should be made to succeed when still at the initial position. This all means you need to inject several .*. There is no need for a conditional.

(?=.*\d{1,8}\z).*?(\d{1,8}\z)|([a-zA-Z] [\d_-]{1,8}[a-zA-Z] ).*

Note also that (?=.*\d{1,8}\z) succeeds if and only when (?=.*\d\z) succeeds, so you can simplify that:

(?=.*\d\z).*?(\d{1,8}\z)|([a-zA-Z] [\d_-]{1,8}[a-zA-Z] ).*

There are two capture groups. It there is a match, exactly one of the capture groups will have a non-empty matching content, which is the content you need.

CodePudding user response:

You want to match a number of specific length at the end of the string, and if there is none, match something else.

There is no need for a conditional here. Conditional patterns are necessary to examine what to match next at the given position inside the string based either on a specific group match or a lookaround test. They are not useful when you want to give priority to a specific pattern.

Here, you can use a PCRE pattern based on the \K operator like

.*?\K\d{1,8}\z|[a-zA-Z] [\d_-]{1,8}[a-zA-Z] 

Or, using capturing groups

(?|.*?(\d{1,8})\z|([a-zA-Z] [\d_-]{1,8}[a-zA-Z] ))

See the regex demo #1 and regex demo #2.

Details:

  • .*?\K\d{1,8}$ - any zero or more chars other than line break chars, as few as possible, then the match reset operator that discards the text matched so far, then one to eight digits at the end of string
  • | - or
  • [a-zA-Z] [\d_-]{1,8}[a-zA-Z] - one or more letters, 1-8 digits, underscores or hyphens, and then one or more letters.

And

  • (?| - start of the branch reset group:
    • .*? - any zero or more chars other than line break chars, as few as possible
    • (\d{1,8}) - Group 1: one to eight digits
    • \z - end of string
  • | - or
    • ( - Group 1 start:
    • [a-zA-Z] - one or more ASCII letters
    • [\d_-]{1,8} - one to eight digits, underscores, hyphens
    • [a-zA-Z] - one or more ASCII letters
    • ) - Group 1 end
  • ) - end of the group.
  • Related