My requirement is very simple, but I just could not figure out how to reach it.
This is the original string ACCCTNGGATGTGGGGGGATGTCCCCCATGTGCTCG
, I want to find out all the sub-strings that only consist of [ACGT]
, end with ATGT
, and have a length of at least 8. And what I expect is:
GGATGTGGGGGGATGT
GGATGTGGGGGGATGTCCCCCATGT
With following code:
import re
seq = 'ACCCTNGGATGTGGGGGGATGTCCCCCATGTGCTCG'
matches = re.findall("[ACGT]{4,}ATGT", seq)
if matches:
for match in matches:
print(match)
I got only
GGATGTGGGGGGATGTCCCCCATGT
The shorter one is missing. Then I realized that re.findall
doesn't allow overlapping. I found this solution from How to use regex to find all overlapping matches, then I modified the code as:
matches = re.findall("(?=([ACGT]{4,}ATGT))", seq)
Then I got:
GGATGTGGGGGGATGTCCCCCATGT
GATGTGGGGGGATGTCCCCCATGT
ATGTGGGGGGATGTCCCCCATGT
TGTGGGGGGATGTCCCCCATGT
GTGGGGGGATGTCCCCCATGT
TGGGGGGATGTCCCCCATGT
GGGGGGATGTCCCCCATGT
GGGGGATGTCCCCCATGT
GGGGATGTCCCCCATGT
GGGATGTCCCCCATGT
GGATGTCCCCCATGT
GATGTCCCCCATGT
ATGTCCCCCATGT
TGTCCCCCATGT
GTCCCCCATGT
TCCCCCATGT
CCCCCATGT
CCCCATGT
Then I realized that this searching starts from right to left. So how can I ask re.findall
to search from left to right and also allow for overlapping?
CodePudding user response:
You can use PyPi's regex module, utilizing reversed and overlapped matching using only a small addition to your initial pattern:
(?r)[ACGT]{4,}ATGT
For example:
import regex as re
seq = 'ACCCTNGGATGTGGGGGGATGTCCCCCATGTGCTCG'
matches = re.findall(r'(?r)[ACGT]{4,}ATGT', seq, overlapped=True)
print(matches)
Prints:
['GGATGTGGGGGGATGTCCCCCATGT', 'GGATGTGGGGGGATGT']