Home > Net >  regular expression: How to match a list of words (allow combination)?
regular expression: How to match a list of words (allow combination)?

Time:04-15

I'm trying to construct a regular expression to capture units and the corresponding values.

For example,

import re 

candis = ['mmol','mm']
test_reg = '|'.join([ut r"\-?[1-4]?" for ut in candis])
test_reg = r"\b(?:"   test_reg   r")\b"
test_reg = r"\d (?:"   test_reg   r"\s?){1,3}"

test_str = '3 mmol mm'
re.findall(test_reg,test_str)

the test_reg is constructed to capture the unit mmol mm and the corresponding value of 3.

However, as you can readily observe in the example, test_reg does not work for a string like 3 mmol2mm because of the \b.

How can I construct a regular expression that can also match 3 mmol2mm and 3 mmolmm, which only contains word combinations that are strictly from candis? (3 mmol mmb won't match)

CodePudding user response:

You can use

\d (?=((?:\s*(?:mmol|mm)-?[1-4]?){1,3}))\1\b

See the regex demo. Details:

  • \d - one or more digits
  • (?=((?:\s*(?:mmol|mm)-?[1-4]?){1,3})) - a positive lookahead with a capturing group inside used to imitate an atomic group, that matches a location that is immediately followed with
    • (?:\s*(?:mmol|mm)-?[1-4]?){1,3} - one, two or three occurrences of
      • \s* - zero or more whitespaces
      • (?:mmol|mm) - a candis value
      • -? - an optional - char
      • [1-4]? - an optional digit from 1 to 4
  • \1 - Group 1 value (backreferences do not allow backtracking)
  • \b - word boundary.

See the Python demo:

import re 

candis = ['mmol','mm']
test_reg = r"\d (?=((?:\s*(?:{})-?[1-4]?){{1,3}}))\1\b".format('|'.join(candis))
test_str = '3 mmol mm 3 mmol2mm and 3 mmolmm AND NOT 3 mmol mmb'
print( [x.group() for x in re.finditer(test_reg,test_str)] )

Output:

['3 mmol mm', '3 mmol2mm', '3 mmolmm']
  • Related