Home > other >  Use lookbehind to ignore a string between two matches, but capture it anyway
Use lookbehind to ignore a string between two matches, but capture it anyway

Time:10-09

I want to match instances of 'the blah blah noun' throughout a text, where the blah blah can include many words, but NOT 'the'.

This regex:

/(\b([Tt]he|[Aa]n|[Aa]|[Aa]nother)\b)?((?!\b(the|an?|another)\b).)*\b(cat|apple)\b/g

Produces these matches: The small cat sat on a huge black cat and ate an apple.

I tried some regex here: https://regexr.com/674l6

The matching is working, but I can't capture complete matches. When you hover over the matches in regexr, not all capture groups are available. I'm only able to capture the beginning and end of the matched string.

There's a basic demo here.

I've been tweaking the regex for hours but to no avail.

CodePudding user response:

For the characters in the middle, you only have a match for the last single char in a capture group, as you are repeating the capture group.

You can add a capture group around the whole that you are matching, and turn the repeating capture group in a non capture group.

As you want to match the blah blah noun the first capture group should not be optional.

The match now has 3 capture groups instead of 5.

\b([Tt]he|[Aa]n(?:other)?|[Aa])\b((?:(?!\b(?:the|an?|another)\b).)*)\b(cat|apple)\b

The updated pattern in parts:

  • \b A word boundary to prevent a partial match
  • ( Capture group 1
    • [Tt]he|[Aa]n(?:other)?|[Aa] Match one of the alternatives
  • ) Close group 1
  • \b A word boundary
  • ( Capture group 2
    • (?:(?!\b(?:the|an?|another)\b).)* Repeat matching any char not directly followed by any of the alternatives (tempered greedy token)
  • ) Close group 2
  • \b(cat|apple)\b Capture group 3, match either cat or apple between word boundaries

Regex demo

See a non greedy pattern version to not capture the surrounding spaces in group 2.

  • Related