When constructing a regular expression for matching a list of candidate strings, how to ensure all the strings can be matched? For example,
This regular expression (?:O|SO|S|OH|OS)(?:\s?[- *°.1-9]){0,4}
can match all the examples below
O 4 2 -
O 2 -
SO 4 * - 2
S 2-
However, if I swap S and SO, the resulting regular expression (?:O|S|SO|OH|OS)(?:\s?[- *°.1-9]){0,4}
failed to match the SO 4 * - 2
as a whole, instead it is separated into two matches: S
and O 4 * - 2
.
So my confusion is how to order the list of candidate strings in the regular expression, so that all of them can be safely and uniquely matched? Since the actual list of candidate strings in my project is a bit more complicated than the example, is there a sorting algorithm that can achieve this?
CodePudding user response:
You could repeat the character class 1 or more times to prevent matching only single uppercase characters from the alternation and reorder the alternatives:
\b(?:SO|OS|O[HS]|[SO])(?:\s?[- *°.1-9]){1,4}
The pattern matches:
\b
A word boundary to prevent a partial word match(?:
Non capture group for the alternativesSO|OS|O[HS]|[SO]
Match eitherSO
OS
OH
OS
S
O
)
Close the non capture group(?:\s?[- *°.1-9]){1,4}
Repeat 1-4 times an optional whitespace char and 1 of the listed characters
See a regex101 demo.
CodePudding user response:
The regular expression engine tries to match the alternatives in the order in which they are specified.
So when the pattern is (S|SO)? it matches S
immediately and continues trying to find matches. The next bit of the input string is O4*-2
which cannot be matched.
So, I think the trick here to match all given string.
(?:O|S)(?:O|H|S)*(?:\s?[- *°.1-9]){0,4}
Demo: https://regex101.com/r/3AwQP7/1
CodePudding user response:
You could add \b
word boundary assertions to ensure that O
and S
match a whole word.
\b(?:O|S|SO|OH|OS)\b(?:\s?[- *°.1-9]){0,4}