I am looking for expressions as Vc Am in texts and for that I have
rex = r"(\(?)(?<!([A-Za-z0-9]))[A-Z][a-z](?!([A-Za-z0-9]))(\)?)"
explanation:
[A-Z][a-z] = Cap followed by lower case letter
(?<!([A-Za-z0-9])) -> lookbehind not being a letter or number
(?!([A-Za-z0-9]))(\)?) _ Look ahead not being letter or number
# all that optionally wihtin parenthesis
import re
text="this is Vc and not Cr nor Pb"
matches = re.finditer(rex,text)
What I want to achieve is exclude a list of terms like Cr or Pb.
How should I include exceptions in the expression?
thanks
CodePudding user response:
First, let's shorten your RegEx:
(?<!([A-Za-z0-9]))
-> lookbehind not being a letter or number(?!([A-Za-z0-9]))(\)?)
-> look ahead not being letter or number
these are so common there is a RegEx feature for them: Word boundaries \b
. They have zero width like lookarounds and only match if there is no alphanumeric character.
Your RegEx then becomes \b[A-Z][a-z]\b
; looking at this RegEx (and your examples), it appears you want to match certain element abbreviations?
Now you can simply use a lookbehind:
to assert that the element is neither Chrome nor Lead.
Just for fun:
Alternatively, if you want a less readable (but more portable) RegEx that makes do with fewer advanced RegEx features (not every engine supports lookaround), you can use character sets as per the following observations:
- If the first letter is not a
C
orP
, the second letter may be any lowercase letter; - If the first letter is a
C
, the second letter may not be anr
- If the first letter is a
P
, the second letter may not be anb
Using character sets, this gives us:
[ABD-OQ-Z][a-z]
C[a-qs-z]
P[ac-z]
Operator precedence works as expected here: Concatenation (implicit) has higher precendence than alteration (|
). This makes the RegEx [ABD-OQ-Z][a-z]|C[a-qs-z]|P[ac-z]
. Wrapping this in word boundaries using a group gives us \b([ABD-OQ-Z][a-z]|C[a-qs-z]|P[ac-z])\b
.
CodePudding user response:
You might write the pattern without using the superfluous capture groups, and exclude matching Cr
or Pb
:
\(?(?<![A-Za-z0-9])(?!Cr\b|Pb\b)[A-Z][a-z](?![A-Za-z0-9])\)?
See a regex demo for the matches.
If you are not interested in matching the parenthesis, and you also do not want to allow an underscore along with the letters or numbers, you can use a word boundary instead:
\b(?!Cr\b|Pb\b)[A-Z][a-z]\b
Explanation
\b
A word boundary to prevent a partial word match(?!
Negative lookaheadCr\b|Pb\b
Match eitherCr
orPb
)
Close the lookahead[A-Z][a-z]
Match a single uppercase and single lowercase char\b
A word boundary