Home > Software engineering >  regex for combining length, inclusion and exclusion?
regex for combining length, inclusion and exclusion?

Time:05-04

A search on SO with just [regex] gave me 249'446 hits and a search with [regex] inclusion exclusion gave me 47 hits but I guess none of the latter (maybe some of the former?) fit my case.

I am also aware, e.g. about this regex page https://www.regular-expressions.info/refquick.html, but I guess there might be a regex concept which I am not yet familiar with and would be grateful for hints.

Here is a minimal example of what I am trying to do with a given list of strings.

Find all items which:

  • have a fixed defined number of characters, i.e. length
  • must include all characters from a certain list (doesn't matter at what position and if multiple times)
  • must NOT include any characters from a certain list

Constructs like: [ei^no]{4}, ((?![no])[ei]){4} and a lot of other more complex trials didn't give the desired results.

Hence, I currently implemented this as a 3 step process with checking the length, doing a search and a match. This looks pretty cumbersome and inefficient to me.

Is there a more efficient way to do this?

Script:

import re

items = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'eleven', 'twelve']

count          = 4
mustContain    = 'ei'   # all of these charactes at least once
mustNotContain = 'no'   # none of those chars

hits1 = []
for item in items:
    if len(item)==count:
        hits1.append(item)
print("Hits1:",hits1)

hits2 = []
for hit in hits1:
    regex = '[{}]'.format(mustContain)
    if re.search(regex,hit):
        hits2.append(hit)
print("Hits2:", hits2)

hits3 = []
for hit in hits2:
    regex = '[{}]'.format(mustNotContain)
    if re.match(regex,hit):
        hits3.append(hit)
print("Hits3:", hits3)


Result:

Hits1: ['four', 'five', 'nine']
Hits2: ['five', 'nine']
Hits3: ['five']

CodePudding user response:

If you are interested in a regex approach, you can create a single dynamic pattern that looks like:

^(?=.{4}$)(?![^no\n]*[no])(?=[^e\n]*e)[^i\n]*i.*$

Explanation

  • ^ Start of string
  • (?=.{4}$) Assert 4 characters
  • (?![^no\n]*[no]) Assert no occurrence of n or o to the right using a leading negated character class
  • (?=[^e\n]*e) Assert an e char to the right
  • [^i\n]*i Match any char except i and then match i
  • .* Match the rest of the line
  • $ end of string

See a regex demo and a Python demo.

Example

import re

items = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'eleven', 'twelve', 'tree']
hits = [item for item in items if re.match(r"(?=.{4}$)(?![^no\n]*[no])(?=[^e\n]*e)[^i\n]*i.*$", item)]

print(hits)

Output

['five']

Using a variation of all and a list comprehension:

items = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'eleven', 'twelve', 'tree']

count = 4
mustContain = ["e", "i"]  # all of these characters at least once
mustNotContain = ["n", "o"]  # none of those chars

hits = [
    item for item in items if
    len(item) == count and
    all([c in item for c in mustContain]) and
    all([c not in item for c in mustNotContain])
]
print(hits)

Output

['five']

See a Python demo.

  • Related