How to isolate protein codons from DNA strand Python-CodePudding

Im working with DNA strands and this code is meant to find the initiation codon (codaoi) and one of the 3 stop codons (codaof1, codaof2 or codaof3) and slice the initial DNA strand from this positions.

So and example : XXXATGYYYYYYTAGXXX

With the correct code i would get YYYYYY. But im always getting the else answer "no protein"

def isolarprot(seqDNA):
    codaof1=("TAG")
    codaof2=("TAA")
    codaof3=("TGA")
    codaoi=("ATG")
    i=0
    f=0
    for i in range(0,len(seqDNA),3):
        pi=seqDNA.find(codaoi)
    for f in range(0,len(seqDNA),3):
        if codaof1 in seqDNA[i:(i 3)] and codaoi in seqDNA[i:(i 3)]:
            pf1=seqDNA.find(codaof1)
            prote=slice(pi,pf1 3)
            return seqDNA[prote]
        elif codaof2 in seqDNA[i:(i 3)] and codaoi in seqDNA[i:(i 3)]:
            pf2=seqDNA.find(codaof2)
            prote=slice(pi,pf2 3)
            return seqDNA[prote]
        elif codaof3 in seqDNA[i:(i 3)] and codaoi in seqDNA[i:(i 3)]:
            pf3=seqDNA.find(codaof3)
            prote=slice(pi,pf3 3)
            return seqDNA[prote]
        else:
            return "No protein"

CodePudding user response：

Below a regular expression pattern able to catch multiple occurrences of the DNA section searched for. It uses positive look behind and positive look forward coupled with a lazy quantifier *? to allow finding multiple occurrences:

seqDNA = "XXXATGYYYYYYTAGXXX XXXATGyyyyyTAAXXX ATGvvvvTGA ATGxxxxGTA"
import re
regex = r"(?<=ATG)(.*?)(?=TAG|TAA|TGA)"
# or: 
#    regex = r"ATG(.*?)(?:TAG|TAA|TGA)"
DNAsliceList = re.findall(regex, seqDNA)
print(DNAsliceList)

gives:

['YYYYYY', 'yyyyy', 'vvvv']

CodePudding user response：

Python's regex module provides a way to search for sub-strings within complicated strings. You can find regex testing webpages such as this one

import re
def isolarprot(seqDNA):
    re_pattern = r'ATG(.*)(TAG|TAA|TGA)'
    matches = re.findall(re_pattern, seqDNA)
    return [match[0] for match in matches]
    
dna_str = 'XXXATGYYYYYYTGAXXX'
print(isolarprot(dna_str))

In the sample code above re_pattern is what you are searching for. Within the pattern anything that matches and is in parentheses () will be captured. In this case you want the first capture group which matches anything between the initiation codon and the stop codons (which are captured in the second capture group.