Home > Software design >  python regex with a variable not working when it contains specific characters
python regex with a variable not working when it contains specific characters

Time:05-20

I'm working with a dataframe with some medicines and I want extract the dosage from a full sentence taken from the product description. There's a dosage for every Active Substance (DCI), which are fed in a list. The dosage for every DCI is generally after its name in the description.

I'm using:

teste=[]
for x in listofdci:
   teste2 = [f"{x}{y}" for x,y in re.findall(rf"(?:{x})\s*(\d (?:[.,]\d )*)\s*(g|mg|)",strength)]
   teste.extend(teste2)

It works well except for cases where the variable contains () or , for example:

listofdci = [' Acid. L( )-lacticum D4']
description = ' Acid. L( )-lacticum D4 250 mg'
#error: nothing to repeat

#

listofdci = ['Zinkoxid', '( /–)-α-Bisabolol', 'Lebertran (Typ A)', 'Retinol (Vitamin A)', 'Colecalciferol (Vitamin D3)']
description = 'Zinkoxid 13 g, ( /–)-α-Bisabolol 0,026 g (eingesetzt als Dragosantol-Zubereitung), Lebertran (Typ A) 5,2 g, Retinol (Vitamin A) 24,5 mg (entspr. 41 600 I.E. Retinolpalmitat [enth. Butylhydroxyanisol, Butylhydroxytoluol]), Colecalciferol (Vitamin D3) 10,4 mg (entspr. 10 400 I.E. mittelkettige Triglyceride [enth. all-rac-α-Tocopherol])'
#error: nothing to repeat
#Here he collects the first dosage -> ['13g'] and then outputs the error

#

listofdci = [' Efeublätter-Trockenextrakt']
description = ' Efeublätter-Trockenextrakt (5-7,5:1) 65 mg - Auszugsmittel: Ethanol 30% (m/m)'
#[]
#here it outputs an empty list

Ideally I want to have:

listofdci = [' Acid. L( )-lacticum D4']
description = ' Acid. L( )-lacticum D4 250 mg'
#['250mg']

#

listofdci = ['Zinkoxid', '( /–)-α-Bisabolol', 'Lebertran (Typ A)', 'Retinol (Vitamin A)', 'Colecalciferol (Vitamin D3)']
description = 'Zinkoxid 13 g, ( /–)-α-Bisabolol 0,026 g (eingesetzt als Dragosantol-Zubereitung), Lebertran (Typ A) 5,2 g, Retinol (Vitamin A) 24,5 mg (entspr. 41 600 I.E. Retinolpalmitat [enth. Butylhydroxyanisol, Butylhydroxytoluol]), Colecalciferol (Vitamin D3) 10,4 mg (entspr. 10 400 I.E. mittelkettige Triglyceride [enth. all-rac-α-Tocopherol])'
#['13g','0,026','5,2g','24,5','10,4']

#

listofdci = [' Efeublätter-Trockenextrakt']
description = ' Efeublätter-Trockenextrakt (5-7,5:1) 65 mg - Auszugsmittel: Ethanol 30% (m/m)'
#[65mg]

I don't know how to dodge this specific problem, besides maybe removing every () or from the dataset. Also, because those characters can appear in every part of the string I don't think I can identify them using sets: '[]'

CodePudding user response:

If there can be an optional substring inside parentheses between the keyword and number, you can use

teste=[]
for x in listofdci:
    test2 = [f"{x}{y}" for x,y in re.findall(rf"{re.escape(x)}(?:\s*\([^()]*\))?\s*(\d (?:[.,]\d )*)\s*(m?g\b|)", description)]
    if test2:
        teste.extend(test2)

See the Python demo.

Details:

  • {re.escape(x)} - the escaped keyword
  • (?:\s*\([^()]*\))? - an optional sequence of zero or more whitespaces, (, zero or more chars other than ( and ) and then a )
  • \s* - zero or more whitespaces
  • (\d (?:[.,]\d )*) - one or more digtis, and then zero or more sequences of . / , and one or more digits
  • \s* - zero or more whitespaces
  • (m?g\b|) - m, mg as whole words, or empty string.
  • Related