I have an Regex with multiple optional and Non-Capturing Groups. All of these groups can occur, but don't have to. The Regex should use Non-Capturing Groups to return the whole string.
When I set the last group also as optional, the Regex will have several grouped results. When I set the first group as not-optional, the Regex matches. Why is that?
The input will be something like input_text = "xyz T1 VX N1 "
, expected output T1 VX N1
.
regexs = {
"allOptional": 'p?(?:T[X0-4]?)?\\s?(?:V[X0-2])?\\s?(?:N[X0-3])?',
"lastNotOptional": 'p?(?:T[X0-4]?)?\\s?(?:V[X0-2])?\\s?(?:N[X0-3])',
"firstNotOptional": 'p?(?:T[X0-4]?)\\s?(?:V[X0-2])?\\s?(?:N[X0-3])?',
}
for key, regex in regexs.items():
matches = re.findall(regex, input_text)
# Results
allOptional = ['', '', '', ' ', 'T1 VX N1', '']
lastNotOptional = ['T1 VX N1']
firstNotOptional = ['T1 VX N1']
Thanks in advance!
CodePudding user response:
I suggest
\b(?=\w)p?(?:T[X0-4]?)?\s?(?:V[X0-2])?\s?(?:N[X0-3])?\b(?<=\w)
See the regex demo.
Alternative for this is a combination of lookarounds that make sure the match is immediately preceded with a whitespace char or start of string, and the first char of a match is a whitespace char, and another lookaround combination (at the end of the pattern) to make sure the match end char is a non-whitespace and then a whitespace or end of string follows:
(?<!\S)(?=\S)p?(?:T[X0-4]?)?\s?(?:V[X0-2])?\s?(?:N[X0-3])?(?!\S)(?<=\S)
See this regex demo.
The main point here are two specific word/whitespace boundaries:
\b(?=\w)
at the start makes sure the word boundary position is matched, that is immediately followed with a word char\b(?<=\w)
at the end asserts the position at the word boundary, with a word char immediately on the left(?<!\S)(?=\S)
- a position that is at the start of string, or immediately after a whitespace and that is immediately followed with a non-whitespace char(?!\S)(?<=\S)
- a position that is at the end of string, or immediately before a whitespace and that is immediately preceded with a non-whitespace char.
See a Python demo:
import re
input_text = "xyz T1 VX N1 G1"
pattern = r'\b(?=\w)p?(?:T[X0-4]?)?\s?(?:V[X0-2])?\s?(?:N[X0-3])?\b(?<=\w)'
print(re.findall(pattern, input_text))
# => ['T1 VX N1']