regex, string contains several specific characters, exactly k times-CodePudding

I have a list of words, and I want to filter them based on specific characters and the number of time each character has to appear, in no particular order. All other characters can appear any number of times. For exmaple,

Filter all the words that contain the letter "a" exactly 1 time, and the letter "b" exactly 2 times.

"bbad" or "bxab" should match, "bbaad" should not.

I currently arrived to this regex which doesn't specify the number of times each character appears:

\b(?=[^\Wa]*a)(?=[^\Wb]*b)\w \b

I tried:

\b(?=[^\Wa]{1})(?=[^\Wb]{2})\w \b

but that doesn't work. Another thing is I want the regex to be somewhat modular, because the desired characters are determined in running time.

Thank you for your time and help!

CodePudding user response：

Dunno if you're set on using regex, but I prefer to use normal logic as it's easier to read.

The code below does what you want. Pass it a list of lists or tuples with the letter to search for, and the amount of times it needs to occur.

def filter_words(text: str, filters: list) -> bool:
    result = True
    for filter in filters:
        result = result and text.count(filter[0]) == filter[1]
    return result


wordlist = ["bbad", "bbaad"]


filters = [
    ("a", 1),
    ("b", 2)
]


for word in wordlist:
    print(f"{word} -> {filter_words(word, filters)}")

output

bbad -> True
bbaad -> False

CodePudding user response：

If the word must contain 'ab' and there must be exactly two 'a''s and one 'b', one could use the following regular expression (which could be constructed programmatically:

(?<!\w)(?=[^a]*(?:a[^a]*){2}\b)(?=[^b]*b[^b]*\b)\w*ab\w*

Demo

If, for example, 3, rather than 2, 'a''s must be present, change {2} to {3}.

(?<!\w)       # preceding char cannot be a word char
(?=           # begin a positive lookahead
  [^a]*       # match >= 0 chars other than 'a'
  (?:a[^a]*)  # match 'a' followed by >= 0 chars other than 'a' in
              # a non-capture group
  {2}         # execute the non-capture group twice
  \b          # match a word boundary
)             # end positive lookahead
(?=           # begin a positive lookahead
  [^b]*       # match >= 0 chars other than 'b'
  b           # match 'b'
  [^b]*       # match >= 0 chars other than 'b'
  \b          # match a word boundary
)             # end positive lookahead
\w*           # match >= 0 word chars
ab            # match 'ab'
\w*           # match >= 0 word chars

(?<!\w) is a negative lookbehind.