find all paterns of a regex in a string-CodePudding

I have an overly complicated regex that as far as I know is correct

route = r"""[\s |\(][iI](\.)?[vV](\.)?(\W|\s|$)?
               |\s intravenously|\s intravenous
               |[\s|\(][pP](\.)?[oO](\.)?(\W|\s|$)
               |\s perorally|\s?(per)?oral(ly)?|\s intraduodenally
               |[\s|\(]i(\.)?p(\.)?(\W|\s|$)?  
               |\s intraperitoneal(ly)?
               |[\s|\(]i(\.)?c(\.)?v(\.)?(\W|\s|$)? 
               |\s intracerebroventricular(ly)?
               |[\s|\(][iI](\.)?[gG](\.)?(\W|\s|$)?
               |\s intragastric(ly)?
               |[\s|\(]s(\.)?c(\.)?(\W|\s|$)?
               |subcutaneous(ly)?(\s injection)?
               |[\s|\(][iI](\.)?[mM](\.)?(\W|\s|$)? 
               |\sintramuscular
          """

With re.search I manage to get one of the numerous patterns if it is a string

s = 'Pharmacokinetics parameters evaluated after single IV or IM'

m = re.search(re.compile(route, re.X), s)
m.group(0)
' IV '

I read somewhere else to use re.findall to find all the occurrences. In my dreams, this would return:

['IV', 'IM']

Unfortunately instead the result is:

[('',
  '',
  ' ',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  ''),
 ('',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '')]

CodePudding user response：

For the excerpt you show:

\b
(?: i 
    (?: ntra
        (?: cerebroventricular (?:ly)?
          | duodenally
          | gastric (?:ly)?
          | muscular
          | peritoneal (?:ly)?
          | venous (?:ly)?
        ) \b
      | \.? (?: [gmpv] | c \.? v ) \b \.?
    )
  |
    (?:per)? oral (?:ly)? \b
  |
    p \.? o \b \.?
  |
    subcutaneous (?:ly)? (?: \s  injection )? \b
)

demo

Advices:

You have a very long pattern, you already use the re.X option that is a good thing, exploit it to the maximum by formatting the pattern in a rigorous and readable way. Eventually, use the alphabetic order. It may sound silly, but what a time saver! It's also possible to add inline comments starting with #.
you have many character classes with the same character in two different cases => use the global re.I flag too and write your pattern in lower case.
I see you try to delimit substrings with things like \s or the ugly [\s|\(] (you don't need to escape a parenthesis in a character class and | doesn't mean OR inside it) and (\W|\s|$)? (that is totally useless since you make it optional). Forget that and use word boundaries \b (read about it to well understand in which cases it matches).
Use re.findall instead or re.search since you expect several matches in a single string.
Use non-capturing groups (?: ... ) instead of capturing groups ( ... ). (when a pattern contains capture groups, re.findall returns only the capture groups content and not the whole match).
factorize your pattern from the left (the pattern is tested from left to right, a factorization from the left reduces the number of branches to test). With this in mind, this subpattern (?:per)? oral (?:ly)? \b | p \.? o \b \.? could be rewritten in this way: oral (?:ly)? \b | p (?: eroral (?:ly)? \b | \.? o \.?)
you can also factorize from the right when possible. It's not a great improvement but it reduces the pattern size.