I'm working with a dataframe with some medicines and I want extract the dosages from a full sentence taken from the product description.
Example of what I want:
Dexamethasonacetat 5 mg/10 mg, Lidocain-HCl 1H2O 30 mg/60 mg
#['5mg/10mg','30mg/60mg']
Anakinra 120 mg /-20 g /-12mg gentechnologisch hergestellt aus E. coli. 10mg pack
#['120mg/20g/12mg']
I can extract the dosage using \d (?:[.,]\d )*\s*(g|mg|)
, which gets me:
Dexamethasonacetat 5 mg/10 mg, Lidocain-HCl 1H2O 30 mg/60 mg
#['5mg','10mg','30mg','60mg']
Anakinra 120 mg /-20 g /-12mg gentechnologisch hergestellt aus E. coli. 10mg pack
#['120mg','20g','12mg','10mg]
It would be easier to do this if /
only happens once, but it can happen multiple times..
CodePudding user response:
You could get those matches using a pattern, and then after process it to remove the spaces and the hyphens
-?\b\d (?:[.,]\d )*\s*m?g(?:\s*/\s*-?\d (?:[.,]\d )*\s*m?g) \b
Explanation
-?
Match an optional hyphen\b
A word boundary to prevent a partial word match\d (?:[.,]\d )*
Match 1 digits with optional decimal part\s*m?g
Match optional whitespace chars, optionalm
andg
(?:
Non capture group to repeat as a whole\s*/\s*
Match/
between optional whitespace chars-?\d (?:[.,]\d )*\s*m?g
Match the same digits pattern as before
)
Close the non capture group and repeat 1 times to match at least a part with a forward slash\b
A word boundary
See a regex demo and a Python demo.
Example
import re
pattern = r"-?\b\d (?:[.,]\d )*\s*m?g(?:\s*/\s*-?\d (?:[.,]\d )*\s*m?g) \b"
strings = [
"Dexamethasonacetat 5 mg/10 mg, Lidocain-HCl 1H2O 30 mg/60 mg",
"Anakinra 120 mg /-20 g /-12mg gentechnologisch hergestellt aus E. coli. 10mg pack"
]
for s in strings:
print([re.sub(r"[\s-] ", "", m) for m in re.findall(pattern, s)])
Output
['5mg/10mg', '30mg/60mg']
['120mg/20g/12mg']