Negative lookbehind quantifier not fixed-CodePudding

I want to extract data relating to a few categories from a text file (except for Categories A and B). The format of the substring would be something like

some text preceding Category C This is some text I'm aware of belonging to the categories

To handle this above, excluding Category A and B data I have a simple negative lookbehind regex as

(?<!Category A )(?<!Category B )This is some text I'm aware of

However, I also have some limited cases where Category A/B in the text would be followed by a few characters (max 5). For instance:

some text Category A 1. This is some text I'm aware of belonging to the categories

So I tried changing the regex to:

(?<!Category A.{5})(?<!Category B.{5})This is some text I'm aware of

It works fine for exactly 5 characters after CatA/B, but it does not allow me to change {5} to {0,5} and complains:

quantifier not fixed

How can I get this to work?

CodePudding user response：

PyPi regex to the rescue:

import regex
pattern = r"(?<!Category A.{0,5})(?<!Category B.{0,5})This is some text I'm aware of"
print(regex.findall(pattern, "This is some text I'm aware of"))

Documentation:

Variable-length lookbehind

A lookbehind can match a variable-length string.

See Python proof

CodePudding user response：

Using Python re, if you want to match Category followed by an uppercase char except A and B, you can match C-Z followed by 0-5 chars and capture the text in a capture group.

\bCategory [C-Z].{0,5}\b(This is some text I'm aware of)\b

Regex demo