I'm preparing for an exam, and here's the task:
Given a string which consists of uppercase letters of English alphabet, find the longest substring of it, which doesn't contain QW
or WQ
.
I know I could do re.split
or something like that, but I made it a challenge for me to do it with a "regex matching" expression like len(max(re.findall(...), key=len))
without using split
or other methods. Is it even possible?
To find all matching subtrings, I tried this:
list(map(lambda x: x[0], re.findall(r'(((?<!QW|WQ).) (?!QW|WQ))', text))
But this does match a substring which ends with WQ
, for example. How do I fix this?
Let me also clarify something. Let's say the string is WQABCDEFGHQW
. The answer in this case is QABCDEFGHQ
, as this is exactly the longest substring that doesn't contain WQ
or QW
.
There can arise a related question. What if the task would be the same, except those Q
-s in the string must not be matched, because they are still parts of WQ
or QW
. (so the answer would be just ABCDEFGH
). In this case @Wiktor Stribiżew's answer is completely correct. And this is a difficult regular expression, and there also are some really helpful links in the comments, so I consider this answer helpful and very interesting as well. Please do not downvote it.
CodePudding user response:
You could use
(?:(?!Q(?!W)|(?<!W)Q).)
Python Example
import re
s = "HAKSDUWQHUPHPSAHFPUAHPUSNFJHJWQHPJQWPHJWQWASDIAS"
print(max(re.findall(r"(?:(?!Q(?!W)|(?<!W)Q).) ", s), key=len)) # 'HUPHPSAHFPUAHPUSNFJHJW'
See the regex demo
CodePudding user response:
.?(?:.(?<!QW|WQ))*
Any single character is ok. Any further character is ok unless it creates QW or WQ.
Python demo along with two split-solutions:
input: 'AQWBQWC':
find: ['AQ', 'WBQ', 'WC', '']
split1: ['AQ', 'WBQ', 'WC']
split2: ['AQ', 'WBQ', 'WC']
input: 'AQWBWQC':
find: ['AQ', 'WBW', 'QC', '']
split1: ['AQ', 'WBW', 'QC']
split2: ['AQ', 'WBW', 'QC']
input: '':
find: ['']
split1: ['']
split2: ['']
input: 'A':
find: ['A', '']
split1: ['A']
split2: ['A']
input: 'Q':
find: ['Q', '']
split1: ['Q']
split2: ['Q']
input: 'QW':
find: ['Q', 'W', '']
split1: ['Q', 'W']
split2: ['Q', 'W']
(The extra empty matches don't matter, just like other non-longest matches don't matter. Will get discarded by your max(..., key=len)
.)
Code (Try it online!):
import re
find = r'.?(?:.(?<!QW|WQ))*'
split1 = r'(?<=Q)(?=W)|(?<=W)(?=Q)'
split2 = r'(?=.(?<=QW|WQ))'
for s in 'AQWBQWC', 'AQWBWQC', '', 'A', 'Q', 'QW':
print('input: ', repr(s) ':')
print('find: ', re.findall(find, s))
print('split1:', re.split(split1, s))
print('split2:', re.split(split2, s))
print()
CodePudding user response:
You can use
(?:[A-Z](?<!QW|WQ)(?<!Q(?=W)|W(?=Q)))
See the regex demo.
Details:
(?:
- start of a non-capturing group:[A-Z]
- an uppercase ASCII letter(?<!QW|WQ)
- a negative lookbehind that fails the match if, immediately on the left, there is aQW
orWQ
substring (i.e. if the[A-Z]
matchedW
(preceded withQ
) orQ
(preceded withW
))(?<!Q(?=W)|W(?=Q))
- a negative lookbehind that fails the match if, immediately on the left, there is aQ
immediately followed withW
, orW
that is immediately followed withQ
(i.e. if[A-Z]
matchedQ
and the next char isW
, or if[A-Z]
matchesW
and the next char isQ
)
)
- end of the group, one or more occurrences.
Another approach:
(?:(?!QW|WQ)[A-Z](?<!QW|WQ))
See this regex demo. Details:
(?:
- start of a non-capturing group:(?!QW|WQ)
- a negative lookahead that fails the match if, immediately on the right, there is aQW
orWQ
substring[A-Z]
- an uppercase ASCII letter(?<!QW|WQ)
- a negative lookbehind that fails the match if, immediately on the left, there is aQW
orWQ
substring
)
- end of the group, one or more occurrences.
CodePudding user response:
You could use that: (?:(?!WQ|QW).(?<!WQ|QW))
At each position (of the dot) it tests forward and backward if the character isn't a part of WQ or QW.
The key is to test forward before the dot and backward after the dot position.
Other possible pattern: (?:.(?!.?(?<=WQ|QW)))
This time a lookbehind is inside a negative lookahead but since it is preceded by an optional character .?
, it checks the two possible positions of the WQ/QW sequence.
Or to reduce the number of steps an unrolled pattern:
[^WQ] (?:[QW](?!.?(?<=QW|WQ))[^WQ]*)*|(?:[QW](?!.?(?<=QW|WQ))[^WQ]*)
But it's a bit long (it's unrolled).