What I'm looking to do is to modify a regex (JS flavor) to not match if the pattern is both preceded and followed by the same string.
By way of a simple analogy, say I want to match all instances of n
that are not both preceded and followed by e
. So, for example, the regex should not match the n
in alkene
, but it should still match the n
in pen
or nest
, which only have the e
directly adjacent to n
on one side, not both.
Most older threads I've seen trying to find an answer basically say "just use negative lookarounds", but the problem is that (?<!e)n(?!e)
doesn't match any of those inputs - because the lookbehind and lookahead are processed by the regex engine separately, so it considers either condition to be sufficient to exclude the match.
(The real regex is (?<!¸ª)()(ɣʷ|h₂|r₂|r₃|w|j)(?:e|o|ø|ɑ|i|ɚ|y|u|a)(?!¸ª)
and it's failing to match the ɣʷ
in t͡ʃe:h₁dɣʷo¸ªh₂¸ª
, but that makes the problem look a lot harder to explain than it needs to be)
How do you modify a regex to only exclude patterns when they're nested?
CodePudding user response:
The (?<!b)a(?!b)
pattern here must be replaced with (?<!b(?=ab))a
or a(?!(?<=ba)b)
. The point is to call a reverse lookahead or lookbehind from lookbehind or lookahead.
See your pattern fix (without any optimizations) where I took the lookahead, pasted it inside lookbehind after ª
, reversed the lookahead (i.e. made it positive) and added the whole pattern before ¸ª
in the lookahead to be able to get to the right-hand ¸ª
:
(?<!¸ª(?!(ɣʷ|h₂|r₂|r₃|w|j)(?:e|o|ø|ɑ|i|ɚ|y|u|a)¸ª))()(ɣʷ|h₂|r₂|r₃|w|j)(?:e|o|ø|ɑ|i|ɚ|y|u|a)
Or, if you put the lookbehind into lookahead:
()(ɣʷ|h₂|r₂|r₃|w|j)(?:e|o|ø|ɑ|i|ɚ|y|u|a)(?!(?<=¸ª(ɣʷ|h₂|r₂|r₃|w|j)(?:e|o|ø|ɑ|i|ɚ|y|u|a))¸ª)
See the regex demo (and regex demo #2).
Whenever your pattern is simple, it is best not to repeat the pattern in the lookarounds, you may usually just use .
or .{x}
where x
stands for the number of chars your consuming pattern part can match. Here, it is not clear how many chars the pattern can actually match, you may probably use (?<!¸ª(?!.{1,2}¸ª))()(ɣʷ|h₂|r₂|r₃|w|j)(?:e|o|ø|ɑ|i|ɚ|y|u|a)
, but I do not have any edge cases to test against.
Enhancing this further may yield (?<!¸ª(?!.{1,2}¸ª))()(ɣʷ|[hr]₂|r₃|w|j)([eoøɑiɚyua])
(demo).