I want to use RegEx to make an entire word to match if and only if it meets both of the following conditions:
There are no repeating letters in the first N characters
There are no repeating letters in the last M characters
Let's say, for example, N = 6 and M = 4. Given the string: "ABCDCFABCD ABCDEFABCB ABCDEFABCD"
, The behavior should be:
"ABCDCFABCD"
> not match becuase it has two C's in the first 6 characters. (ABCDCF)
"ABCDEFABCB"
> not match becuase it has two B's in the last 4 characters. (ABCB)
"ABCDEFABCD"
> match because it has no repeating letters in the first 6 and last 4 characters.
CodePudding user response:
One can test if the first 6 or last 4 characters of a string contain a repeating character by attempting to match the regular expression
RGX = /^.*(.).*\1.*(?<=^.{6})|(?=.{4}$).*(.).*\2/
Words that do have no repeating characters among the first 6 or last 4 are therefore word that do not match this regular expression. In Ruby, for example, we could write:
arr = ["ABCDCFABCD ABCDEFABCB", "ABCDCFABCB", "ABCDCFXYZABCD",
"ABCDEFXYZABCB", "ABCDCFXYZABCB", "ABCDEFABCD", "ABCDEFXYZABCD"]
The words among the elements of arr
that have no repeating characters among the first 6 or last 4 are obtained by rejecting those words that match the regular expression RGX
:
arr.reject { |word| word.match?(RGX) }
["ABCDEFABCD", "ABCDEFXYZABCD"]
The elements of the regular expression are as follows.
^ # match beginning of string
.* # match zero or more chars
(.) # match a char and save to capture group 1
.* # match >= 0 chars
\1 # match content of capture group 1
.* # match >= 0 chars
(?<=^.{6}) # string location is preceded by 6 chars at start of string
| # or
(?=.{4}$) # string location is followed by 4 chars at end of string
.* # match >= 0 chars
(.) # match a char and save to capture group 2
.* # match >= 0 chars
\2 # match content of capture group 2
(?<=^.{6})
is a positive lookbehind; (?=.{4}$)
is a positive lookahead.
CodePudding user response:
I'm pretty sure this is way more complicated with regex than without.
def get_matches(document, N, M):
matches = []
for word in document.split():
# check if number of unique characters is smaller than number of characters for both substrings
if len(set(word[:N])) < len(word[:N]) or len(set(word[-M:])) < len(word[-M:]):
continue
matches.append(word)
return matches
Or alternatively
def get_matches(document, N, M):
return [word for word in document.split() if len(set(word[:N])) == len(word[:N]) and len(set(word[-M:])) == len(word[-M:])]
Which returns
print(get_matches("ABCDCFABCD ABCDEFABCB ABCDEFABCD"))
> ['ABCDEFABCD']
If it has to be regex then you could probably try something like backreferences to single character capture groups, but im not sure how to make that work for this case.