I want to extract data relating to a few categories from a text file (except for Categories A and B). The format of the substring would be something like
some text preceding Category C This is some text I'm aware of belonging to the categories
To handle this above, excluding Category A and B data I have a simple negative lookbehind regex as
(?<!Category A )(?<!Category B )This is some text I'm aware of
However, I also have some limited cases where Category A/B
in the text would be followed by a few characters (max 5). For instance:
some text Category A 1. This is some text I'm aware of belonging to the categories
So I tried changing the regex to:
(?<!Category A.{5})(?<!Category B.{5})This is some text I'm aware of
It works fine for exactly 5 characters after CatA/B, but it does not allow me to change {5}
to {0,5}
and complains:
quantifier not fixed
How can I get this to work?
CodePudding user response:
PyPi regex to the rescue:
import regex
pattern = r"(?<!Category A.{0,5})(?<!Category B.{0,5})This is some text I'm aware of"
print(regex.findall(pattern, "This is some text I'm aware of"))
Documentation:
Variable-length lookbehind
A lookbehind can match a variable-length string.
See Python proof
CodePudding user response:
Using Python re, if you want to match Category followed by an uppercase char except A and B, you can match C-Z followed by 0-5 chars and capture the text in a capture group.
\bCategory [C-Z].{0,5}\b(This is some text I'm aware of)\b