Exclude words with repeating letters in the first N characters and in the last M characters-CodePudding

I want to use RegEx to make an entire word to match if and only if it meets both of the following conditions:

There are no repeating letters in the first N characters
There are no repeating letters in the last M characters

Let's say, for example, N = 6 and M = 4. Given the string: "ABCDCFABCD ABCDEFABCB ABCDEFABCD", The behavior should be:

"ABCDCFABCD" > not match becuase it has two C's in the first 6 characters. (ABCDCF)

"ABCDEFABCB" > not match becuase it has two B's in the last 4 characters. (ABCB)

"ABCDEFABCD" > match because it has no repeating letters in the first 6 and last 4 characters.

CodePudding user response：

One can test if the first 6 or last 4 characters of a string contain a repeating character by attempting to match the regular expression

RGX = /^.*(.).*\1.*(?<=^.{6})|(?=.{4}$).*(.).*\2/

Demo

Words that do have no repeating characters among the first 6 or last 4 are therefore word that do not match this regular expression. In Ruby, for example, we could write:

arr = ["ABCDCFABCD ABCDEFABCB", "ABCDCFABCB", "ABCDCFXYZABCD",
       "ABCDEFXYZABCB", "ABCDCFXYZABCB", "ABCDEFABCD", "ABCDEFXYZABCD"]

The words among the elements of arr that have no repeating characters among the first 6 or last 4 are obtained by rejecting those words that match the regular expression RGX:

arr.reject { |word| word.match?(RGX) }
  ["ABCDEFABCD", "ABCDEFXYZABCD"]

The elements of the regular expression are as follows.

^           # match beginning of string
.*          # match zero or more chars
(.)         # match a char and save to capture group 1
.*          # match >= 0 chars
\1          # match content of capture group 1
.*          # match >= 0 chars
(?<=^.{6})  # string location is preceded by 6 chars at start of string
|           # or
(?=.{4}$)   # string location is followed by 4 chars at end of string
.*          # match >= 0 chars
(.)         # match a char and save to capture group 2
.*          # match >= 0 chars
\2          # match content of capture group 2

(?<=^.{6}) is a positive lookbehind; (?=.{4}$) is a positive lookahead.

CodePudding user response：

I'm pretty sure this is way more complicated with regex than without.

def get_matches(document, N, M):
    matches = []        
    for word in document.split():
        # check if number of unique characters is smaller than number of characters for both substrings
        if len(set(word[:N])) < len(word[:N]) or len(set(word[-M:])) < len(word[-M:]):
            continue
        matches.append(word)
    return matches

Or alternatively

def get_matches(document, N, M):
    return [word for word in document.split() if len(set(word[:N])) == len(word[:N]) and len(set(word[-M:])) == len(word[-M:])]

Which returns

print(get_matches("ABCDCFABCD ABCDEFABCB ABCDEFABCD"))
> ['ABCDEFABCD']

If it has to be regex then you could probably try something like backreferences to single character capture groups, but im not sure how to make that work for this case.