Allow Python re.findall to find overlapping mathes from left to right-CodePudding

My requirement is very simple, but I just could not figure out how to reach it.

This is the original string ACCCTNGGATGTGGGGGGATGTCCCCCATGTGCTCG, I want to find out all the sub-strings that only consist of [ACGT], end with ATGT, and have a length of at least 8. And what I expect is:

GGATGTGGGGGGATGT
GGATGTGGGGGGATGTCCCCCATGT

With following code:

import re

seq = 'ACCCTNGGATGTGGGGGGATGTCCCCCATGTGCTCG'

matches = re.findall("[ACGT]{4,}ATGT", seq)

if matches:
    for match in matches:
        print(match)

I got only

GGATGTGGGGGGATGTCCCCCATGT

The shorter one is missing. Then I realized that re.findall doesn't allow overlapping. I found this solution from How to use regex to find all overlapping matches, then I modified the code as:

matches = re.findall("(?=([ACGT]{4,}ATGT))", seq)

Then I got:

GGATGTGGGGGGATGTCCCCCATGT
GATGTGGGGGGATGTCCCCCATGT
ATGTGGGGGGATGTCCCCCATGT
TGTGGGGGGATGTCCCCCATGT
GTGGGGGGATGTCCCCCATGT
TGGGGGGATGTCCCCCATGT
GGGGGGATGTCCCCCATGT
GGGGGATGTCCCCCATGT
GGGGATGTCCCCCATGT
GGGATGTCCCCCATGT
GGATGTCCCCCATGT
GATGTCCCCCATGT
ATGTCCCCCATGT
TGTCCCCCATGT
GTCCCCCATGT
TCCCCCATGT
CCCCCATGT
CCCCATGT

Then I realized that this searching starts from right to left. So how can I ask re.findall to search from left to right and also allow for overlapping?

CodePudding user response：

You can use PyPi's regex module, utilizing reversed and overlapped matching using only a small addition to your initial pattern:

(?r)[ACGT]{4,}ATGT

For example:

import regex as re
seq = 'ACCCTNGGATGTGGGGGGATGTCCCCCATGTGCTCG'
matches = re.findall(r'(?r)[ACGT]{4,}ATGT', seq, overlapped=True)
print(matches)

Prints:

['GGATGTGGGGGGATGTCCCCCATGT', 'GGATGTGGGGGGATGT']