Home > other >  How can I make Python re work like grep for repeating groups?
How can I make Python re work like grep for repeating groups?

Time:03-16

I have the following string:

seq = 'MNRYLNRQRLYNMYRNKYRGVMEPMSRMTMDFQGRYMDSQGRMVDPRYYDHYGRMHDYDRYYGRSMFNQGHSMDSQRYGGWMDNPERYMDMSGYQMDMQGRWMDAQGRYNNPFSQMWHSRQGH'

also saved in a file called seq.dat. If I use the following grep command

grep '\([MF]D.\{4,6\}\)\{3,10\}' seq.dat

I get the following matching string:

MDNPERYMDMSGYQMDMQGRWMDAQGRYN

which is what I want. In words, what I want to match is as many consecutive repeats as the string has of [MF]D.{4,6}. I don't want to match cases where it has less than 3 consecutive repeats, but I want it to be able to capture up to 6.

Now, I'm trying to do this with python. I have

p = re.compile("(?:[MF]D.{4,6}){3,10}")

Trying search() returns

MDNPERYMDMSGYQMDMQGRWM

It is the close to the answer I seek, but is still missing the last MDAQGRYN. I'm guessing this is because .{4,6} matches the M, which in turn prevents {3,10} from capturing this 4th occurence of ([MF]D.{4,6}), but since I asked for at least 3, it's happy and it stops.

How do I make Python regex behave like grep does?

CodePudding user response:

There is a fundamental difference between POSIX ("text-directed") and NFA ("regex-directed") engines. POSIX engines (grep here uses a POSIX BRE regex flavor, it is the flavor used by default) will parse the input text applying the regex to it and return the longest match possible. NFA engine (Python re engine is an NFA engine) here does not re-consume (backtrack) when the subsequent pattern parts match.

See reference on regex-directed and text-directed engines:

A regex-directed engine walks through the regex, attempting to match the next token in the regex to the next character. If a match is found, the engine advances through the regex and the subject string. If a token fails to match, the engine backtracks to a previous position in the regex and the subject string where it can try a different path through the regex... Modern regex flavors using regex-directed engines have lots of features such as atomic grouping and possessive quantifiers that allow you to control this backtracking.

A text-directed engine walks through the subject string, attempting all permutations of the regex before advancing to the next character in the string. A text-directed engine never backtracks. Thus, there isn’t much to discuss about the matching process of a text-directed engine. In most cases, a text-directed engine finds the same matches as a regex-directed engine.

The last sentence says "in most cases", but not all cases, and yours is a good illustration that discrepances may occur.

To avoid consuming M or F that are immediately followed with D, I'd suggest using

(?:[MF]D(?:(?![MF]D).){4,6}){3,10}

See the regex demo. Details:

  • (?: - start of an outer non-capturing container group:
    • [MF]D - M or F and then D
    • (?:(?![MF]D).){4,6} - any char (other than a line break) repeated four to six times, that does not start an MD or FD char sequence
  • ){3,10} - end of the outer group, repeat 3 to 10 times.

By the way, if you only want to match uppercase ASCII letters, replace the . with [A-Z].

  • Related