python regex matching only if the groups are between a special character-CodePudding

I'm working with a dataframe with some medicines and I want extract the dosages from a full sentence taken from the product description.

Example of what I want:

Dexamethasonacetat 5 mg/10 mg, Lidocain-HCl 1H2O 30 mg/60 mg
#['5mg/10mg','30mg/60mg']

Anakinra 120 mg /-20 g /-12mg gentechnologisch hergestellt aus E. coli. 10mg pack
#['120mg/20g/12mg']

I can extract the dosage using \d (?:[.,]\d )*\s*(g|mg|), which gets me:

Dexamethasonacetat 5 mg/10 mg, Lidocain-HCl 1H2O 30 mg/60 mg
#['5mg','10mg','30mg','60mg']

Anakinra 120 mg /-20 g /-12mg gentechnologisch hergestellt aus E. coli. 10mg pack
#['120mg','20g','12mg','10mg]

It would be easier to do this if / only happens once, but it can happen multiple times..

CodePudding user response：

You could get those matches using a pattern, and then after process it to remove the spaces and the hyphens

-?\b\d (?:[.,]\d )*\s*m?g(?:\s*/\s*-?\d (?:[.,]\d )*\s*m?g) \b

Explanation

-? Match an optional hyphen
\b A word boundary to prevent a partial word match
\d (?:[.,]\d )* Match 1 digits with optional decimal part
\s*m?g Match optional whitespace chars, optional m and g
(?: Non capture group to repeat as a whole
- \s*/\s* Match / between optional whitespace chars
- -?\d (?:[.,]\d )*\s*m?g Match the same digits pattern as before
) Close the non capture group and repeat 1 times to match at least a part with a forward slash
\b A word boundary

See a regex demo and a Python demo.

Example

import re

pattern = r"-?\b\d (?:[.,]\d )*\s*m?g(?:\s*/\s*-?\d (?:[.,]\d )*\s*m?g) \b"

strings = [
    "Dexamethasonacetat 5 mg/10 mg, Lidocain-HCl 1H2O 30 mg/60 mg",
    "Anakinra 120 mg /-20 g /-12mg gentechnologisch hergestellt aus E. coli. 10mg pack"
]

for s in strings:
    print([re.sub(r"[\s-] ", "", m) for m in re.findall(pattern, s)])

Output

['5mg/10mg', '30mg/60mg']
['120mg/20g/12mg']