Longest substring that doesn't match a REGEX pattern-CodePudding

I'm preparing for an exam, and here's the task: Given a string which consists of uppercase letters of English alphabet, find the longest substring of it, which doesn't contain QW or WQ.

I know I could do re.split or something like that, but I made it a challenge for me to do it with a "regex matching" expression like len(max(re.findall(...), key=len)) without using split or other methods. Is it even possible?

To find all matching subtrings, I tried this:

list(map(lambda x: x[0], re.findall(r'(((?<!QW|WQ).) (?!QW|WQ))', text))

But this does match a substring which ends with WQ, for example. How do I fix this?

Let me also clarify something. Let's say the string is WQABCDEFGHQW. The answer in this case is QABCDEFGHQ, as this is exactly the longest substring that doesn't contain WQ or QW.

There can arise a related question. What if the task would be the same, except those Q-s in the string must not be matched, because they are still parts of WQ or QW. (so the answer would be just ABCDEFGH). In this case @Wiktor Stribiżew's answer is completely correct. And this is a difficult regular expression, and there also are some really helpful links in the comments, so I consider this answer helpful and very interesting as well. Please do not downvote it.

CodePudding user response：

You could use

(?:(?!Q(?!W)|(?<!W)Q).)

Python Example

import re

s = "HAKSDUWQHUPHPSAHFPUAHPUSNFJHJWQHPJQWPHJWQWASDIAS"
print(max(re.findall(r"(?:(?!Q(?!W)|(?<!W)Q).) ", s), key=len))  # 'HUPHPSAHFPUAHPUSNFJHJW'

See the regex demo

CodePudding user response：

.?(?:.(?<!QW|WQ))*

Any single character is ok. Any further character is ok unless it creates QW or WQ.

Demo

Python demo along with two split-solutions:

input:   'AQWBQWC':
find:   ['AQ', 'WBQ', 'WC', '']
split1: ['AQ', 'WBQ', 'WC']
split2: ['AQ', 'WBQ', 'WC']

input:   'AQWBWQC':
find:   ['AQ', 'WBW', 'QC', '']
split1: ['AQ', 'WBW', 'QC']
split2: ['AQ', 'WBW', 'QC']

input:   '':
find:   ['']
split1: ['']
split2: ['']

input:   'A':
find:   ['A', '']
split1: ['A']
split2: ['A']

input:   'Q':
find:   ['Q', '']
split1: ['Q']
split2: ['Q']

input:   'QW':
find:   ['Q', 'W', '']
split1: ['Q', 'W']
split2: ['Q', 'W']

(The extra empty matches don't matter, just like other non-longest matches don't matter. Will get discarded by your max(..., key=len).)

Code (Try it online!):

import re

find  = r'.?(?:.(?<!QW|WQ))*'
split1 = r'(?<=Q)(?=W)|(?<=W)(?=Q)'
split2 = r'(?=.(?<=QW|WQ))'

for s in 'AQWBQWC', 'AQWBWQC', '', 'A', 'Q', 'QW':
    print('input:  ', repr(s)   ':')
    print('find:  ', re.findall(find, s))
    print('split1:', re.split(split1, s))
    print('split2:', re.split(split2, s))
    print()

CodePudding user response：

You can use

(?:[A-Z](?<!QW|WQ)(?<!Q(?=W)|W(?=Q)))

See the regex demo.

Details:

(?: - start of a non-capturing group:
- [A-Z] - an uppercase ASCII letter
- (?<!QW|WQ) - a negative lookbehind that fails the match if, immediately on the left, there is a QW or WQ substring (i.e. if the [A-Z] matched W (preceded with Q) or Q (preceded with W))
- (?<!Q(?=W)|W(?=Q)) - a negative lookbehind that fails the match if, immediately on the left, there is a Q immediately followed with W, or W that is immediately followed with Q (i.e. if [A-Z] matched Q and the next char is W, or if [A-Z] matches W and the next char is Q)
) - end of the group, one or more occurrences.

Another approach:

(?:(?!QW|WQ)[A-Z](?<!QW|WQ))

See this regex demo. Details:

(?: - start of a non-capturing group:
- (?!QW|WQ) - a negative lookahead that fails the match if, immediately on the right, there is a QW or WQ substring
- [A-Z] - an uppercase ASCII letter
- (?<!QW|WQ) - a negative lookbehind that fails the match if, immediately on the left, there is a QW or WQ substring
) - end of the group, one or more occurrences.

CodePudding user response：

You could use that: (?:(?!WQ|QW).(?<!WQ|QW))

At each position (of the dot) it tests forward and backward if the character isn't a part of WQ or QW.

The key is to test forward before the dot and backward after the dot position.

demo

Other possible pattern: (?:.(?!.?(?<=WQ|QW)))

This time a lookbehind is inside a negative lookahead but since it is preceded by an optional character .?, it checks the two possible positions of the WQ/QW sequence.

demo

Or to reduce the number of steps an unrolled pattern:
[^WQ] (?:[QW](?!.?(?<=QW|WQ))[^WQ]*)*|(?:[QW](?!.?(?<=QW|WQ))[^WQ]*)

But it's a bit long (it's unrolled).

demo