Home > Blockchain >  python regex findall returns two groups instead of just one
python regex findall returns two groups instead of just one

Time:05-19

I'm working with a dataframe with some medicines and I want extract the dosage from a full sentence taken from the product description.

some examples:

'Anakinra 100 g, gentechnologisch hergestellt aus E. coli.'
'Anakinra 100 mg, gentechnologisch hergestellt aus E. coli.'
'Anakinra 10.5 g, gentechnologisch hergestellt aus E. coli.'
'Anakinra 10, gentechnologisch hergestellt aus E. coli.'

I want to have:

'100g'
'100mg'
'10.5g'
'10'

Because I want to do this to every product, I decided to use a regex with the product's name as a variable, so I can later run a cycle for the full list of products.

I tried:

a_string = "Anakinra 100 mg, gentechnologisch hergestellt aus E. coli."
pattern = 'Anakinra'
re.findall(f"({pattern}\s*\d (?:[.,]\d )*\s*\b(g|mg|)", a_string)

#[('Anakinra 100 mg', 'mg')]

As you can see it's returning two groups instead of just one. This might not be the right procedure either because in the end I only want the dosage part of the string. What would be your solution?

CodePudding user response:

You can capture the necessary details and then join the two groups:

import re
a_string = "Anakinra 100 mg, gentechnologisch hergestellt aus E. coli."
pattern = 'Anakinra'
print ( [f"{x}{y}" for x,y in re.findall(rf"(?:{pattern})\s*(\d (?:[.,]\d )*)\s*(g|mg|)", a_string)] )
# => ['100mg']

See the Python demo.

See the regex demo. Details:

  • (?:Anakinra) - a keyword (I kept the group in case there are several keywords like Anakinra|Anakirna)
  • \s* - zero or more whitespaces
  • (\d (?:[.,]\d )*) - Group 1: one or more digits, and then zero or more repetitions of . or , and one or more digits
  • \s* - zero or more whitespaces
  • (g|mg|) - Group 2: g, mg, or nothing (you can use (mg?|), too)

CodePudding user response:

You can try with the following regex:

(?![^\d] )[^,] 

Explanation:

  • (?![^\d] ): Negative lookahead that matches any character other than digit
  • [^,] : any character other than the comma

Try it here.


EDIT: In case you need a more strict version.

(?!^'[^\d] )\d (\.\d)?( m?g)?

Explanation:

  • (?!^'[^\d] ): Negative lookahead that matches ...
    • ^': begin of string quote
    • [^\d] : any combination of characters other than digit
  • \d : combination of digits
  • (\.\d )?: optional sequence of dot digits
  • ( m?g)?: optional sequence of space optional m g

Try it here.

  • Related