Here is my requirement, I want to recognize a valid String definition in compiler design, the string should either start and end with double quote ("hello world"
), or start and end with single quote('hello world'
).
I used (['"]).*\1
to achieve the goal, the \1
here is to reference previous first captured group, namely first single or double quote, as explanation from regex 101,
\1 matches the same text as most recently matched by the 1st capturing group
It works so far so good.
Then I got new requirement, which is to treat an inner single quote in external single quotes as invalid vase, and same to double quotes situation. Which means both 'hello ' world'
and "hello " world"
are invalid case.
I think the solution should not be hard if we can represent not previous 1st captured group, something like (['"])(?:NOT\1)*\1
.
The (?:)
here is used as a non capturing group, to make sure \1
represents to first quote always. But the key is how to replace NOT with correct regex symbol. It's not like my previous experience about exclusion, like [^abcd]
to exclude abcd
, but to exclude the previous capture group and the symbol ^
doesn't work that way.
CodePudding user response:
The probably most efficient method for the task, a simple alternation, has been mentioned by @LorenzHetterich in his first comment.
^(?:"[^"]*"|'[^']*')$
It just alternates between either pairs of quotes without any of the same type inside.
The technique to exclude a capture between certain parts is known as tempered greedy token.
Best to use it if there are no other options available.
^(['"])(?:(?!\1).)*\1$
The greedy dot gets tempered here by what's captured in the first group and won't skip over.
To prevent matching more than two captures, another option is using a negative lookahead to check after capturing the first match if there are not two ahead:
^(['"])(?!(?:.*?\1){2}).*
FYI: If patterns are used with Java matches()
, the ^
start and $
end anchors are not needed.