Home > Software design >  Regex having optional groups with non-capturing groups
Regex having optional groups with non-capturing groups

Time:12-16

I have an Regex with multiple optional and Non-Capturing Groups. All of these groups can occur, but don't have to. The Regex should use Non-Capturing Groups to return the whole string.

When I set the last group also as optional, the Regex will have several grouped results. When I set the first group as not-optional, the Regex matches. Why is that?

The input will be something like input_text = "xyz T1 VX N1 ", expected output T1 VX N1.

regexs = {
    "allOptional": 'p?(?:T[X0-4]?)?\\s?(?:V[X0-2])?\\s?(?:N[X0-3])?',
    "lastNotOptional": 'p?(?:T[X0-4]?)?\\s?(?:V[X0-2])?\\s?(?:N[X0-3])',
    "firstNotOptional": 'p?(?:T[X0-4]?)\\s?(?:V[X0-2])?\\s?(?:N[X0-3])?',
}

for key, regex in regexs.items():
    matches = re.findall(regex, input_text)

    # Results
    allOptional = ['', '', '', ' ', 'T1 VX N1', '']
    lastNotOptional = ['T1 VX N1']
    firstNotOptional = ['T1 VX N1']

Thanks in advance!

CodePudding user response:

I suggest

\b(?=\w)p?(?:T[X0-4]?)?\s?(?:V[X0-2])?\s?(?:N[X0-3])?\b(?<=\w)

See the regex demo.

Alternative for this is a combination of lookarounds that make sure the match is immediately preceded with a whitespace char or start of string, and the first char of a match is a whitespace char, and another lookaround combination (at the end of the pattern) to make sure the match end char is a non-whitespace and then a whitespace or end of string follows:

(?<!\S)(?=\S)p?(?:T[X0-4]?)?\s?(?:V[X0-2])?\s?(?:N[X0-3])?(?!\S)(?<=\S)

See this regex demo.

The main point here are two specific word/whitespace boundaries:

  • \b(?=\w) at the start makes sure the word boundary position is matched, that is immediately followed with a word char
  • \b(?<=\w) at the end asserts the position at the word boundary, with a word char immediately on the left
  • (?<!\S)(?=\S) - a position that is at the start of string, or immediately after a whitespace and that is immediately followed with a non-whitespace char
  • (?!\S)(?<=\S) - a position that is at the end of string, or immediately before a whitespace and that is immediately preceded with a non-whitespace char.

See a Python demo:

import re
input_text = "xyz T1 VX N1 G1"
pattern = r'\b(?=\w)p?(?:T[X0-4]?)?\s?(?:V[X0-2])?\s?(?:N[X0-3])?\b(?<=\w)'
print(re.findall(pattern, input_text))
# => ['T1 VX N1']
  • Related