I'm trying to construct a regular expression to capture units and the corresponding values.
For example,
import re
candis = ['mmol','mm']
test_reg = '|'.join([ut r"\-?[1-4]?" for ut in candis])
test_reg = r"\b(?:" test_reg r")\b"
test_reg = r"\d (?:" test_reg r"\s?){1,3}"
test_str = '3 mmol mm'
re.findall(test_reg,test_str)
the test_reg
is constructed to capture the unit mmol mm
and the corresponding value of 3.
However, as you can readily observe in the example, test_reg
does not work for a string like 3 mmol2mm
because of the \b
.
How can I construct a regular expression that can also match 3 mmol2mm
and 3 mmolmm
, which only contains word combinations that are strictly from candis
? (3 mmol mmb
won't match)
CodePudding user response:
You can use
\d (?=((?:\s*(?:mmol|mm)-?[1-4]?){1,3}))\1\b
See the regex demo. Details:
\d
- one or more digits(?=((?:\s*(?:mmol|mm)-?[1-4]?){1,3}))
- a positive lookahead with a capturing group inside used to imitate an atomic group, that matches a location that is immediately followed with(?:\s*(?:mmol|mm)-?[1-4]?){1,3}
- one, two or three occurrences of\s*
- zero or more whitespaces(?:mmol|mm)
- acandis
value-?
- an optional-
char[1-4]?
- an optional digit from1
to4
\1
- Group 1 value (backreferences do not allow backtracking)\b
- word boundary.
See the Python demo:
import re
candis = ['mmol','mm']
test_reg = r"\d (?=((?:\s*(?:{})-?[1-4]?){{1,3}}))\1\b".format('|'.join(candis))
test_str = '3 mmol mm 3 mmol2mm and 3 mmolmm AND NOT 3 mmol mmb'
print( [x.group() for x in re.finditer(test_reg,test_str)] )
Output:
['3 mmol mm', '3 mmol2mm', '3 mmolmm']