Home > database >  extract medicine name and dose with regex pattern
extract medicine name and dose with regex pattern

Time:04-16

I would like to extract medical information in text that has the following format STRING (CHAR - STRING - STRING - [STRING - STRING - STRING]) for example:

OLANZAPINE 10 MG - ORODISPERSIBLE TABLET  (S - n/a - Drug withdrawn - [n/a - n/a - n/a])

I would like to extract

  • OLANZAPINE 10 MG - ORODISPERSIBLE TABLET
  • S
  • n/a
  • Drug withdrawn
  • n/a
  • n/a
  • n/a

Currently, I use this but it extracts all words seperately

s = 'OLANZAPINE 10 MG - ORODISPERSIBLE TABLET  (S - n/a - Drug withdrawn - [n/a - n/a - n/a])'
re.findall(r"[a-zA-Z.0-9/] ", s)

['OLANZAPINE',
 '10',
 'MG',
 'ORODISPERSIBLE',
 'TABLET',
 'S',
 'n/a',
 'Drug',
 'withdrawn',
 'n/a',
 'n/a',
 'n/a']

CodePudding user response:

You can try catching your needed information by exploiting the dash character - and parentheses ( and ).

Here's my attempt:

(^|\n)(.*) \(([^\-] )- ([^\-] )- ([^-] )- \[([^\-] )- ([^\-] )- ([^\-] )\]\)

You can retrieve your matches using groups from Group 2 to Group 8.

Is it what you're looking for?

CodePudding user response:

If the entries always have the same format it could be something along the lines of:

^(?P<capture_1>[\w\s\-] )\((?P<capture_2>\w)[\s\-] (?P<capture_3>[\S\w] )[\s\-] (?P<capture_4>[\s\w] )[\s\-] \[(?P<capture_5>[\S\w] )[\s\-] (?P<capture_6>[\S\w] )[\s\-] (?P<capture_7>[\S\w] )\]\)

See: https://regex101.com/r/zCCWh2/1

CodePudding user response:

The pattern that you used [a-zA-Z.0-9/] is a character class that matches 1 or more times any of the listed characters.

It does not take any context into account like matching parenthesis or differentiating between a single or multiple characters.

You might use a pattern like:

(. ?)\s \(([A-Za-z])\s -\s ([^-] )\s -\s ([^-] ?)\s -\s \[([^][\s] )\s -\s ([^][\s] )\s -\s ([^][\s] )]\)

The separate parts match:

  • (. ?) Capture 1 characters as least as possible
  • \s \( Match 1 whitespace chars and (
  • ([A-Za-z]) Capture a single char
  • \s -\s Match - between 1 whitspace chars
  • ([^-] ) Capture 1 chars other than -
  • \s -\s Match - between 1 whitspace chars
  • ([^-] ?) Capture 1 chars other than - as least as possible
  • \s -\s Match - between 1 whitspace chars
  • \[ Match [
  • ([^][\s] ) Capture 1 chars other than a whitespace char or square brackets
  • \s -\s Match - between 1 whitspace chars
  • ([^][\s] ) Capture 1 chars other than a whitespace char or square brackets
  • \s -\s Match - between 1 whitspace chars
  • ([^][\s] ) Capture 1 chars other than a whitespace char or square brackets
  • ]\) Match ])

See a regex demo.

Note that \s can also match a newline.

  • Related