Regex Pattern to Match except when the clause enclosed by the tilde (~) on both sides-CodePudding

I want to extract matches of the clauses match-this that is enclosed with anything other than the tilde (~) in the string.

For example, in this string:

match-this~match-this~ match-this ~match-this#match-this~match-this~match-this

There should be 5 matches from above. The matches are explained below (enclosed by []):

Either match-this~ or match-this is correct for first match.
match-this is correct for 2nd match.
Either ~match-this# or ~match-this is correct for 3rd match.
Either #match-this~ or #match-this or match-this~ is correct for 4th match.
Either ~match-this or match-this is correct for 5th match.

I can use the pattern ~match-this~ catch these ~match-this~, but when I tried the negation of it (?!(~match-this)), it literally catches all nulls.

When I tried the pattern [^~]match-this[^~], it catches only one match (the 2nd match from above). And when I tried to add asterisk wild card on any negation of tilde, either [^~]match-this[^~]* or [^~]*match-this[^~], I got only 2 matches. When I put the asterisk wild card on both, it catches all match-this including those which enclosed by tildes ~. Is it possible to achieve this with only one regex test? Or Does it need more??

CodePudding user response：

If you also want to match #match-this~ as a separate match, you would have to account for # while matching, as [^~] also matches #

You could match what you don't want, and capture in a group what you want to keep.

~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)

Explanation

~[^~#]*~ Match any char except ~ or # between ~
| Or
( Capture group 1
- (?:(?!match-this).)* Match any char if not directly followed by *match-this~
- match-this Match literally
- (?:(?!match-this)[^#~])* Match any char except ~ or # if not directly followed by match this
) Close group 1

See a regex demo and a Python demo.

Example

import re

pattern = r"~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)"
s = "match-this~match-this~ match-this ~match-this#match-this~match-this~match-this"
res = [m for m in re.findall(pattern, s) if m]

print (res)

Output

['match-this', ' match-this ', '~match-this', '#match-this', 'match-this']

CodePudding user response：

For this you need both kinds of lookarounds. This will match the 5 spots you want, and there's a reason why it only works this way and not another and why the prefix and/or suffix can't be included:

(?<=~)match-this(?!~)|(?<!~)match-this(?=~)|(?<!~)match-this(?!~)

Explaining lookarounds:

(?=...) is a positive lookahead: what comes next must match
(?!...) is a negative lookahead: what comes next must not match
(?<=...) is a positive lookbehind: what comes before must match
(?<!...) is a negative lookbehind: what comes before must not match

Why other ways won't work:

[^~] is a class with negation, but it always needs one character to be there and also consumes that character for the match itself. The former is a problem for a starting text. The latter is a problem for having advanced too far, so a "don't match" character is gone already.
(^|[^~]) would solve the first problem: either the text starts or it must be a character not matching this. We could do the same for ending texts, but this is a dead again anyway.
Only lookarounds remain, and even then we have to code all 3 variants, hence the two |.
As per the nature of lookarounds the character in front or behind cannot be captured. Additionally if you want to also match either a leading or a trailing character then this collides with recognizing the next potential match.

It's a difference between telling the engine to "not match" a character and to tell the engine to "look out" for something without actually consuming characters and advancing the current position in the text. Also not every regex engine supports all lookarounds, so it matters where you actually want to use it. For me it works fine in TextPad 8 and should also work fine in PCRE (f.e. in PHP). As per regex101.com/r/CjcaWQ/1 it also works as expected by me.

What irritates me: if the leading and/or trailing character of a found match is important to you, then just extract it from the input when processing all the matches, since they also come with starting positions and lengths: first match at position 0 for 10 characters means you look at input text position -1 and 10.

CodePudding user response：

If all five matches can be "match-this" (contradicting the requirement for the 3^rd match) you can match the regular expression

~match-this~|(\bmatch-this\b)

and keep only matches that are captured (to capture group 1). The idea is to discard matches that are not captured and keep matches that are captured. When the regex engine matches "~match-this~" it's internal string pointer is moved just past the closing "~", thereby skipping an unwanted substring.

Demo

The regular expression can be broken down as follows.

~match-this~   # match literal
|              # or
(              # begin capture group 1
  \b           # match a word boundary
  match-this   # match literal
  \b           # match a word boundary
)              # end capture group 1

Being so simple, this regular expression would be supported by most regex engines.