Home > Enterprise >  Trying to exclude match when surrounded on both sides by a certain string
Trying to exclude match when surrounded on both sides by a certain string

Time:10-17

What I'm looking to do is to modify a regex (JS flavor) to not match if the pattern is both preceded and followed by the same string.

By way of a simple analogy, say I want to match all instances of n that are not both preceded and followed by e. So, for example, the regex should not match the n in alkene, but it should still match the n in pen or nest, which only have the e directly adjacent to n on one side, not both.

Most older threads I've seen trying to find an answer basically say "just use negative lookarounds", but the problem is that (?<!e)n(?!e) doesn't match any of those inputs - because the lookbehind and lookahead are processed by the regex engine separately, so it considers either condition to be sufficient to exclude the match.

(The real regex is (?<!¸ª)()(ɣʷ|h₂|r₂|r₃|w|j)(?:e|o|ø|ɑ|i|ɚ|y|u|a)(?!¸ª) and it's failing to match the ɣʷ in t͡ʃe:h₁dɣʷo¸ªh₂¸ª, but that makes the problem look a lot harder to explain than it needs to be)

How do you modify a regex to only exclude patterns when they're nested?

CodePudding user response:

The (?<!b)a(?!b) pattern here must be replaced with (?<!b(?=ab))a or a(?!(?<=ba)b). The point is to call a reverse lookahead or lookbehind from lookbehind or lookahead.

See your pattern fix (without any optimizations) where I took the lookahead, pasted it inside lookbehind after ª, reversed the lookahead (i.e. made it positive) and added the whole pattern before ¸ª in the lookahead to be able to get to the right-hand ¸ª:

(?<!¸ª(?!(ɣʷ|h₂|r₂|r₃|w|j)(?:e|o|ø|ɑ|i|ɚ|y|u|a)¸ª))()(ɣʷ|h₂|r₂|r₃|w|j)(?:e|o|ø|ɑ|i|ɚ|y|u|a)

Or, if you put the lookbehind into lookahead:

()(ɣʷ|h₂|r₂|r₃|w|j)(?:e|o|ø|ɑ|i|ɚ|y|u|a)(?!(?<=¸ª(ɣʷ|h₂|r₂|r₃|w|j)(?:e|o|ø|ɑ|i|ɚ|y|u|a))¸ª)

See the regex demo (and regex demo #2).

Whenever your pattern is simple, it is best not to repeat the pattern in the lookarounds, you may usually just use . or .{x} where x stands for the number of chars your consuming pattern part can match. Here, it is not clear how many chars the pattern can actually match, you may probably use (?<!¸ª(?!.{1,2}¸ª))()(ɣʷ|h₂|r₂|r₃|w|j)(?:e|o|ø|ɑ|i|ɚ|y|u|a), but I do not have any edge cases to test against.

Enhancing this further may yield (?<!¸ª(?!.{1,2}¸ª))()(ɣʷ|[hr]₂|r₃|w|j)([eoøɑiɚyua]) (demo).

  • Related