Home > Mobile >  python regex include exceptions in the regex expression
python regex include exceptions in the regex expression

Time:09-10

I am looking for expressions as Vc Am in texts and for that I have

rex = r"(\(?)(?<!([A-Za-z0-9]))[A-Z][a-z](?!([A-Za-z0-9]))(\)?)"

explanation:

[A-Z][a-z] = Cap followed by lower case letter 
(?<!([A-Za-z0-9])) -> lookbehind not being a letter or number
(?!([A-Za-z0-9]))(\)?) _ Look ahead not being letter or number
# all that optionally wihtin parenthesis

import re
text="this is Vc and not Cr nor Pb"
matches = re.finditer(rex,text)

What I want to achieve is exclude a list of terms like Cr or Pb.

How should I include exceptions in the expression?

thanks

CodePudding user response:

First, let's shorten your RegEx:

  • (?<!([A-Za-z0-9])) -> lookbehind not being a letter or number
  • (?!([A-Za-z0-9]))(\)?) -> look ahead not being letter or number

these are so common there is a RegEx feature for them: Word boundaries \b. They have zero width like lookarounds and only match if there is no alphanumeric character.

Your RegEx then becomes \b[A-Z][a-z]\b; looking at this RegEx (and your examples), it appears you want to match certain element abbreviations?

Now you can simply use a lookbehind:

\b[A-Z][a-z](?<!Cr|Pb)\b

to assert that the element is neither Chrome nor Lead.


Just for fun:

Alternatively, if you want a less readable (but more portable) RegEx that makes do with fewer advanced RegEx features (not every engine supports lookaround), you can use character sets as per the following observations:

  • If the first letter is not a C or P, the second letter may be any lowercase letter;
  • If the first letter is a C, the second letter may not be an r
  • If the first letter is a P, the second letter may not be an b

Using character sets, this gives us:

  • [ABD-OQ-Z][a-z]
  • C[a-qs-z]
  • P[ac-z]

Operator precedence works as expected here: Concatenation (implicit) has higher precendence than alteration (|). This makes the RegEx [ABD-OQ-Z][a-z]|C[a-qs-z]|P[ac-z]. Wrapping this in word boundaries using a group gives us \b([ABD-OQ-Z][a-z]|C[a-qs-z]|P[ac-z])\b.

CodePudding user response:

You might write the pattern without using the superfluous capture groups, and exclude matching Cr or Pb:

\(?(?<![A-Za-z0-9])(?!Cr\b|Pb\b)[A-Z][a-z](?![A-Za-z0-9])\)?

See a regex demo for the matches.


If you are not interested in matching the parenthesis, and you also do not want to allow an underscore along with the letters or numbers, you can use a word boundary instead:

\b(?!Cr\b|Pb\b)[A-Z][a-z]\b

Explanation

  • \b A word boundary to prevent a partial word match
  • (?! Negative lookahead
    • Cr\b|Pb\b Match either Cr or Pb
  • ) Close the lookahead
  • [A-Z][a-z] Match a single uppercase and single lowercase char
  • \b A word boundary

Regex demo

  • Related