I want to extract matches of the clauses match-this
that is enclosed with anything other than the tilde (~) in the string.
For example, in this string:
match-this~match-this~ match-this ~match-this#match-this~match-this~match-this
There should be 5 matches from above. The matches are explained below (enclosed by []):
- Either
match-this~
ormatch-this
is correct for first match. match-this
is correct for 2nd match.- Either
~match-this#
or~match-this
is correct for 3rd match. - Either
#match-this~
or#match-this
ormatch-this~
is correct for 4th match. - Either
~match-this
ormatch-this
is correct for 5th match.
I can use the pattern ~match-this~
catch these ~match-this~
, but when I tried the negation of it (?!(~match-this))
, it literally catches all nulls.
When I tried the pattern [^~]match-this[^~]
, it catches only one match (the 2nd match from above). And when I tried to add asterisk wild card on any negation of tilde, either [^~]match-this[^~]*
or [^~]*match-this[^~]
, I got only 2 matches. When I put the asterisk wild card on both, it catches all match-this
including those which enclosed by tildes ~
.
Is it possible to achieve this with only one regex test? Or Does it need more??
CodePudding user response:
If you also want to match #match-this~
as a separate match, you would have to account for # while matching, as [^~]
also matches #
You could match what you don't want, and capture in a group what you want to keep.
~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)
Explanation
~[^~#]*~
Match any char except~
or#
between~
|
Or(
Capture group 1(?:(?!match-this).)*
Match any char if not directly followed by *match-this~match-this
Match literally(?:(?!match-this)[^#~])*
Match any char except~
or#
if not directly followed by match this
)
Close group 1
See a regex demo and a Python demo.
Example
import re
pattern = r"~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)"
s = "match-this~match-this~ match-this ~match-this#match-this~match-this~match-this"
res = [m for m in re.findall(pattern, s) if m]
print (res)
Output
['match-this', ' match-this ', '~match-this', '#match-this', 'match-this']
CodePudding user response:
For this you need both kinds of lookarounds. This will match the 5 spots you want, and there's a reason why it only works this way and not another and why the prefix and/or suffix can't be included:
(?<=~)match-this(?!~)|(?<!~)match-this(?=~)|(?<!~)match-this(?!~)
Explaining lookarounds:
(?=...)
is a positive lookahead: what comes next must match(?!...)
is a negative lookahead: what comes next must not match(?<=...)
is a positive lookbehind: what comes before must match(?<!...)
is a negative lookbehind: what comes before must not match
Why other ways won't work:
[^~]
is a class with negation, but it always needs one character to be there and also consumes that character for the match itself. The former is a problem for a starting text. The latter is a problem for having advanced too far, so a "don't match" character is gone already.(^|[^~])
would solve the first problem: either the text starts or it must be a character not matching this. We could do the same for ending texts, but this is a dead again anyway.- Only lookarounds remain, and even then we have to code all 3 variants, hence the two
|
. - As per the nature of lookarounds the character in front or behind cannot be captured. Additionally if you want to also match either a leading or a trailing character then this collides with recognizing the next potential match.
It's a difference between telling the engine to "not match" a character and to tell the engine to "look out" for something without actually consuming characters and advancing the current position in the text. Also not every regex engine supports all lookarounds, so it matters where you actually want to use it. For me it works fine in TextPad 8 and should also work fine in PCRE (f.e. in PHP). As per regex101.com/r/CjcaWQ/1 it also works as expected by me.
What irritates me: if the leading and/or trailing character of a found match is important to you, then just extract it from the input when processing all the matches, since they also come with starting positions and lengths: first match at position 0
for 10
characters means you look at input text position -1
and 10
.
CodePudding user response:
If all five matches can be "match-this"
(contradicting the requirement for the 3rd match) you can match the regular expression
~match-this~|(\bmatch-this\b)
and keep only matches that are captured (to capture group 1). The idea is to discard matches that are not captured and keep matches that are captured. When the regex engine matches "~match-this~"
it's internal string pointer is moved just past the closing "~"
, thereby skipping an unwanted substring.
The regular expression can be broken down as follows.
~match-this~ # match literal
| # or
( # begin capture group 1
\b # match a word boundary
match-this # match literal
\b # match a word boundary
) # end capture group 1
Being so simple, this regular expression would be supported by most regex engines.