I'm working with a dataframe with some medicines and I want extract the dosage from a full sentence taken from the product description. There's a dosage for every Active Substance (DCI), which are fed in a list. The dosage for every DCI is generally after its name in the description
.
I'm using:
teste=[]
for x in listofdci:
teste2 = [f"{x}{y}" for x,y in re.findall(rf"(?:{x})\s*(\d (?:[.,]\d )*)\s*(g|mg|)",strength)]
teste.extend(teste2)
It works well except for cases where the variable contains ()
or
, for example:
listofdci = [' Acid. L( )-lacticum D4']
description = ' Acid. L( )-lacticum D4 250 mg'
#error: nothing to repeat
#
listofdci = ['Zinkoxid', '( /–)-α-Bisabolol', 'Lebertran (Typ A)', 'Retinol (Vitamin A)', 'Colecalciferol (Vitamin D3)']
description = 'Zinkoxid 13 g, ( /–)-α-Bisabolol 0,026 g (eingesetzt als Dragosantol-Zubereitung), Lebertran (Typ A) 5,2 g, Retinol (Vitamin A) 24,5 mg (entspr. 41 600 I.E. Retinolpalmitat [enth. Butylhydroxyanisol, Butylhydroxytoluol]), Colecalciferol (Vitamin D3) 10,4 mg (entspr. 10 400 I.E. mittelkettige Triglyceride [enth. all-rac-α-Tocopherol])'
#error: nothing to repeat
#Here he collects the first dosage -> ['13g'] and then outputs the error
#
listofdci = [' Efeublätter-Trockenextrakt']
description = ' Efeublätter-Trockenextrakt (5-7,5:1) 65 mg - Auszugsmittel: Ethanol 30% (m/m)'
#[]
#here it outputs an empty list
Ideally I want to have:
listofdci = [' Acid. L( )-lacticum D4']
description = ' Acid. L( )-lacticum D4 250 mg'
#['250mg']
#
listofdci = ['Zinkoxid', '( /–)-α-Bisabolol', 'Lebertran (Typ A)', 'Retinol (Vitamin A)', 'Colecalciferol (Vitamin D3)']
description = 'Zinkoxid 13 g, ( /–)-α-Bisabolol 0,026 g (eingesetzt als Dragosantol-Zubereitung), Lebertran (Typ A) 5,2 g, Retinol (Vitamin A) 24,5 mg (entspr. 41 600 I.E. Retinolpalmitat [enth. Butylhydroxyanisol, Butylhydroxytoluol]), Colecalciferol (Vitamin D3) 10,4 mg (entspr. 10 400 I.E. mittelkettige Triglyceride [enth. all-rac-α-Tocopherol])'
#['13g','0,026','5,2g','24,5','10,4']
#
listofdci = [' Efeublätter-Trockenextrakt']
description = ' Efeublätter-Trockenextrakt (5-7,5:1) 65 mg - Auszugsmittel: Ethanol 30% (m/m)'
#[65mg]
I don't know how to dodge this specific problem, besides maybe removing every ()
or
from the dataset. Also, because those characters can appear in every part of the string I don't think I can identify them using sets: '[]'
CodePudding user response:
If there can be an optional substring inside parentheses between the keyword and number, you can use
teste=[]
for x in listofdci:
test2 = [f"{x}{y}" for x,y in re.findall(rf"{re.escape(x)}(?:\s*\([^()]*\))?\s*(\d (?:[.,]\d )*)\s*(m?g\b|)", description)]
if test2:
teste.extend(test2)
See the Python demo.
Details:
{re.escape(x)}
- the escaped keyword(?:\s*\([^()]*\))?
- an optional sequence of zero or more whitespaces,(
, zero or more chars other than(
and)
and then a)
\s*
- zero or more whitespaces(\d (?:[.,]\d )*)
- one or more digtis, and then zero or more sequences of.
/,
and one or more digits\s*
- zero or more whitespaces(m?g\b|)
-m
,mg
as whole words, or empty string.