Home > Software engineering >  Regular expression to find words containing a subset of specific letters and frequencies of these le
Regular expression to find words containing a subset of specific letters and frequencies of these le

Time:12-21

Given specific letters, such as "CDEIORSVY", I would like to use a regular expression that filters all possible words in the English dictionary (which is provided, and all capitalized) to leave k-letter words that contain only these letters. In this example, a 9-letter solution is "DISCOVERY".

This problem is akin to finding suitable words for Scrabble, or finding solutions to the Letters round or Conundrum in the game show Countdown.

Given specific letters with repeat letters, such as "DENOOPRSU", a 9-letter solution is "PONDEROUS" but not "SPONSORED". A 7-letter solution is "ONEROUS" but not "USURPER".

My question is what would be the regular expression that takes into account the constraints of specific letters, frequencies of letters and k-letter solutions?

My regular expression so far is: "^[DENOOPRSU]{9,9}$" and "^[DENOOPRSU]{7,7}$" for the example above. However, this does not take into regard the constraints on the frequencies of the letters, and produces the incorrect words as in the examples above. My workaround is to filter the results from this regular expression by using Counter from the Collections library on Python, but this is very slow. Therefore, I would like a regular expression that incorporates the constraints of letters and frequencies.

CodePudding user response:

Not sure if it's possible to do this with a single static regex, but a dynamic implementation can be done. See sample regex below:

^(?!.*([DENPRSU]).*\1)(?!.*([O]).*\2.*\2)[DENOOPRSU]

The first group of letters is constrained to only appear once in the solution, and the second group may only appear twice. More groups can be appended with the same pattern - for example constraining a letter to appear three times or less would just be (?!.*([chosen letters]).*\3.*\3.*\3).

CodePudding user response:

Perhaps you could use an approach without a regex, getting the number of occurrences for the letters in a dictionary.

Then check if there is a rest after intersecting the 2 input strings, and if there is then there is a character present that should not be used.

def charsWithCounter(s):
    res = {}
    for c in s:
        res[c] = res.get(c, 0)   1
    return res

def isValid(str_source, str_to_compare):
    if list(set(str_to_compare) - set(str_source)):
        return False

    dct_source = charsWithCounter(str_source)
    dct_to_compare = charsWithCounter(str_to_compare)

    for key in dct_to_compare.keys():
        if key in dct_source and dct_to_compare[key] > dct_source[key]:
            return False
    return True


dct = {
    "DENOOPRSU": ["PONDEROUS", "PONDEROUSZ", "PONDEROU S", "SPONSORED", "ONEROUS", "USURPER", "ABC", " ", "OOO", "O"],
    "CDEIORSVY": ["DISCOVERY"]
}

for k, v in dct.items():
    for s in v:
        print("'{}' --> '{}' : {}".format(k, s, str(isValid(k, s))))

Output

'DENOOPRSU' --> 'PONDEROUS' : True
'DENOOPRSU' --> 'PONDEROUSZ' : False
'DENOOPRSU' --> 'PONDEROU S' : False
'DENOOPRSU' --> 'SPONSORED' : False
'DENOOPRSU' --> 'ONEROUS' : True
'DENOOPRSU' --> 'USURPER' : False
'DENOOPRSU' --> 'ABC' : False
'DENOOPRSU' --> ' ' : False
'DENOOPRSU' --> 'OOO' : False
'DENOOPRSU' --> 'O' : True
'CDEIORSVY' --> 'DISCOVERY' : True

See a Python demo

  • Related