Home > Mobile >  find all paterns of a regex in a string
find all paterns of a regex in a string

Time:06-29

I have an overly complicated regex that as far as I know is correct

route = r"""[\s |\(][iI](\.)?[vV](\.)?(\W|\s|$)?
               |\s intravenously|\s intravenous
               |[\s|\(][pP](\.)?[oO](\.)?(\W|\s|$)
               |\s perorally|\s?(per)?oral(ly)?|\s intraduodenally
               |[\s|\(]i(\.)?p(\.)?(\W|\s|$)?  
               |\s intraperitoneal(ly)?
               |[\s|\(]i(\.)?c(\.)?v(\.)?(\W|\s|$)? 
               |\s intracerebroventricular(ly)?
               |[\s|\(][iI](\.)?[gG](\.)?(\W|\s|$)?
               |\s intragastric(ly)?
               |[\s|\(]s(\.)?c(\.)?(\W|\s|$)?
               |subcutaneous(ly)?(\s injection)?
               |[\s|\(][iI](\.)?[mM](\.)?(\W|\s|$)? 
               |\sintramuscular
          """

With re.search I manage to get one of the numerous patterns if it is a string

s = 'Pharmacokinetics parameters evaluated after single IV or IM'

m = re.search(re.compile(route, re.X), s)
m.group(0)
' IV '

I read somewhere else to use re.findall to find all the occurrences. In my dreams, this would return:

['IV', 'IM']

Unfortunately instead the result is:

[('',
  '',
  ' ',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  ''),
 ('',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '')]

CodePudding user response:

For the excerpt you show:

\b
(?: i 
    (?: ntra
        (?: cerebroventricular (?:ly)?
          | duodenally
          | gastric (?:ly)?
          | muscular
          | peritoneal (?:ly)?
          | venous (?:ly)?
        ) \b
      | \.? (?: [gmpv] | c \.? v ) \b \.?
    )
  |
    (?:per)? oral (?:ly)? \b
  |
    p \.? o \b \.?
  |
    subcutaneous (?:ly)? (?: \s  injection )? \b
)

demo

Advices:

  • You have a very long pattern, you already use the re.X option that is a good thing, exploit it to the maximum by formatting the pattern in a rigorous and readable way. Eventually, use the alphabetic order. It may sound silly, but what a time saver! It's also possible to add inline comments starting with #.
  • you have many character classes with the same character in two different cases => use the global re.I flag too and write your pattern in lower case.
  • I see you try to delimit substrings with things like \s or the ugly [\s|\(] (you don't need to escape a parenthesis in a character class and | doesn't mean OR inside it) and (\W|\s|$)? (that is totally useless since you make it optional). Forget that and use word boundaries \b (read about it to well understand in which cases it matches).
  • Use re.findall instead or re.search since you expect several matches in a single string.
  • Use non-capturing groups (?: ... ) instead of capturing groups ( ... ). (when a pattern contains capture groups, re.findall returns only the capture groups content and not the whole match).
  • factorize your pattern from the left (the pattern is tested from left to right, a factorization from the left reduces the number of branches to test). With this in mind, this subpattern (?:per)? oral (?:ly)? \b | p \.? o \b \.? could be rewritten in this way: oral (?:ly)? \b | p (?: eroral (?:ly)? \b | \.? o \.?)
  • you can also factorize from the right when possible. It's not a great improvement but it reduces the pattern size.
  • Related