I have an overly complicated regex that as far as I know is correct
route = r"""[\s |\(][iI](\.)?[vV](\.)?(\W|\s|$)?
|\s intravenously|\s intravenous
|[\s|\(][pP](\.)?[oO](\.)?(\W|\s|$)
|\s perorally|\s?(per)?oral(ly)?|\s intraduodenally
|[\s|\(]i(\.)?p(\.)?(\W|\s|$)?
|\s intraperitoneal(ly)?
|[\s|\(]i(\.)?c(\.)?v(\.)?(\W|\s|$)?
|\s intracerebroventricular(ly)?
|[\s|\(][iI](\.)?[gG](\.)?(\W|\s|$)?
|\s intragastric(ly)?
|[\s|\(]s(\.)?c(\.)?(\W|\s|$)?
|subcutaneous(ly)?(\s injection)?
|[\s|\(][iI](\.)?[mM](\.)?(\W|\s|$)?
|\sintramuscular
"""
With re.search
I manage to get one of the numerous patterns if it is a string
s = 'Pharmacokinetics parameters evaluated after single IV or IM'
m = re.search(re.compile(route, re.X), s)
m.group(0)
' IV '
I read somewhere else to use re.findall
to find all the occurrences.
In my dreams, this would return:
['IV', 'IM']
Unfortunately instead the result is:
[('',
'',
' ',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
''),
('',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'')]
CodePudding user response:
For the excerpt you show:
\b
(?: i
(?: ntra
(?: cerebroventricular (?:ly)?
| duodenally
| gastric (?:ly)?
| muscular
| peritoneal (?:ly)?
| venous (?:ly)?
) \b
| \.? (?: [gmpv] | c \.? v ) \b \.?
)
|
(?:per)? oral (?:ly)? \b
|
p \.? o \b \.?
|
subcutaneous (?:ly)? (?: \s injection )? \b
)
Advices:
- You have a very long pattern, you already use the re.X option that is a good thing, exploit it to the maximum by formatting the pattern in a rigorous and readable way. Eventually, use the alphabetic order. It may sound silly, but what a time saver! It's also possible to add inline comments starting with
#
. - you have many character classes with the same character in two different cases => use the global re.I flag too and write your pattern in lower case.
- I see you try to delimit substrings with things like
\s
or the ugly[\s|\(]
(you don't need to escape a parenthesis in a character class and|
doesn't mean OR inside it) and(\W|\s|$)?
(that is totally useless since you make it optional). Forget that and use word boundaries\b
(read about it to well understand in which cases it matches). - Use
re.findall
instead or re.search since you expect several matches in a single string. - Use non-capturing groups
(?: ... )
instead of capturing groups( ... )
. (when a pattern contains capture groups,re.findall
returns only the capture groups content and not the whole match). - factorize your pattern from the left (the pattern is tested from left to right, a factorization from the left reduces the number of branches to test). With this in mind, this subpattern
(?:per)? oral (?:ly)? \b | p \.? o \b \.?
could be rewritten in this way:oral (?:ly)? \b | p (?: eroral (?:ly)? \b | \.? o \.?)
- you can also factorize from the right when possible. It's not a great improvement but it reduces the pattern size.