I would like to extract medical information in text that has the following format
STRING (CHAR - STRING - STRING - [STRING - STRING - STRING])
for example:
OLANZAPINE 10 MG - ORODISPERSIBLE TABLET (S - n/a - Drug withdrawn - [n/a - n/a - n/a])
I would like to extract
- OLANZAPINE 10 MG - ORODISPERSIBLE TABLET
- S
- n/a
- Drug withdrawn
- n/a
- n/a
- n/a
Currently, I use this but it extracts all words seperately
s = 'OLANZAPINE 10 MG - ORODISPERSIBLE TABLET (S - n/a - Drug withdrawn - [n/a - n/a - n/a])'
re.findall(r"[a-zA-Z.0-9/] ", s)
['OLANZAPINE',
'10',
'MG',
'ORODISPERSIBLE',
'TABLET',
'S',
'n/a',
'Drug',
'withdrawn',
'n/a',
'n/a',
'n/a']
CodePudding user response:
You can try catching your needed information by exploiting the dash character -
and parentheses (
and )
.
Here's my attempt:
(^|\n)(.*) \(([^\-] )- ([^\-] )- ([^-] )- \[([^\-] )- ([^\-] )- ([^\-] )\]\)
You can retrieve your matches using groups from Group 2 to Group 8.
Is it what you're looking for?
CodePudding user response:
If the entries always have the same format it could be something along the lines of:
^(?P<capture_1>[\w\s\-] )\((?P<capture_2>\w)[\s\-] (?P<capture_3>[\S\w] )[\s\-] (?P<capture_4>[\s\w] )[\s\-] \[(?P<capture_5>[\S\w] )[\s\-] (?P<capture_6>[\S\w] )[\s\-] (?P<capture_7>[\S\w] )\]\)
See: https://regex101.com/r/zCCWh2/1
CodePudding user response:
The pattern that you used [a-zA-Z.0-9/]
is a character class that matches 1 or more times any of the listed characters.
It does not take any context into account like matching parenthesis or differentiating between a single or multiple characters.
You might use a pattern like:
(. ?)\s \(([A-Za-z])\s -\s ([^-] )\s -\s ([^-] ?)\s -\s \[([^][\s] )\s -\s ([^][\s] )\s -\s ([^][\s] )]\)
The separate parts match:
(. ?)
Capture 1 characters as least as possible\s \(
Match 1 whitespace chars and(
([A-Za-z])
Capture a single char\s -\s
Match-
between 1 whitspace chars([^-] )
Capture 1 chars other than-
\s -\s
Match-
between 1 whitspace chars([^-] ?)
Capture 1 chars other than-
as least as possible\s -\s
Match-
between 1 whitspace chars\[
Match[
([^][\s] )
Capture 1 chars other than a whitespace char or square brackets\s -\s
Match-
between 1 whitspace chars([^][\s] )
Capture 1 chars other than a whitespace char or square brackets\s -\s
Match-
between 1 whitspace chars([^][\s] )
Capture 1 chars other than a whitespace char or square brackets]\)
Match])
See a regex demo.
Note that \s
can also match a newline.