Home > other >  Exclude words with repeating letters in the first N characters and in the last M characters
Exclude words with repeating letters in the first N characters and in the last M characters


I want to use RegEx to make an entire word to match if and only if it meets both of the following conditions:

  • There are no repeating letters in the first N characters

  • There are no repeating letters in the last M characters

Let's say, for example, N = 6 and M = 4. Given the string: "ABCDCFABCD ABCDEFABCB ABCDEFABCD", The behavior should be:

"ABCDCFABCD" > not match becuase it has two C's in the first 6 characters. (ABCDCF)

"ABCDEFABCB" > not match becuase it has two B's in the last 4 characters. (ABCB)

"ABCDEFABCD" > match because it has no repeating letters in the first 6 and last 4 characters.

CodePudding user response:

One can test if the first 6 or last 4 characters of a string contain a repeating character by attempting to match the regular expression

RGX = /^.*(.).*\1.*(?<=^.{6})|(?=.{4}$).*(.).*\2/


Words that do have no repeating characters among the first 6 or last 4 are therefore word that do not match this regular expression. In Ruby, for example, we could write:


The words among the elements of arr that have no repeating characters among the first 6 or last 4 are obtained by rejecting those words that match the regular expression RGX:

arr.reject { |word| word.match?(RGX) }

The elements of the regular expression are as follows.

^           # match beginning of string
.*          # match zero or more chars
(.)         # match a char and save to capture group 1
.*          # match >= 0 chars
\1          # match content of capture group 1
.*          # match >= 0 chars
(?<=^.{6})  # string location is preceded by 6 chars at start of string
|           # or
(?=.{4}$)   # string location is followed by 4 chars at end of string
.*          # match >= 0 chars
(.)         # match a char and save to capture group 2
.*          # match >= 0 chars
\2          # match content of capture group 2

(?<=^.{6}) is a positive lookbehind; (?=.{4}$) is a positive lookahead.

CodePudding user response:

I'm pretty sure this is way more complicated with regex than without.

def get_matches(document, N, M):
    matches = []        
    for word in document.split():
        # check if number of unique characters is smaller than number of characters for both substrings
        if len(set(word[:N])) < len(word[:N]) or len(set(word[-M:])) < len(word[-M:]):
    return matches

Or alternatively

def get_matches(document, N, M):
    return [word for word in document.split() if len(set(word[:N])) == len(word[:N]) and len(set(word[-M:])) == len(word[-M:])]

Which returns


If it has to be regex then you could probably try something like backreferences to single character capture groups, but im not sure how to make that work for this case.

  • Related