Adjusting the fucntion to find location for more than one base-CodePudding

I created this function and it finds the location of the base in a dna sequences. Like dna = ['A', 'G', 'C', 'G', 'T', 'A', 'G', 'T', 'C', 'G', 'A', 'T', 'C', 'A', 'A', 'T', 'T', 'A', 'T', 'A', 'C', 'G', 'A', 'T', 'C', 'G', 'G', 'G', 'T', 'A', 'T']. I need it to find more than one base at a time like 'A''T'. Can anyone help?

def position(list, value):
     pos = []
     for n in range(len(list)):
             if list[n] == value:
                     pos.append(n)
     return pos

CodePudding user response：

You can work with the dna sequence as a string, and then use regex:

import re

dna_str = ''.join(dna)

pattern = r'AT'

pos = [(i.start(0), i.end(0)) for i in re.finditer(pattern, dna_str)]
print(pos)

[(10, 12), (14, 16), (17, 19), (22, 24), (29, 31)]

CodePudding user response：

side note, good not to use keywords for variable names. list is a python keyword

def position(l: list, values: list): -> list
     pos = []
     for i, val in enumerate(l):
             if val in values:
                     pos.append(i)
     return pos

CodePudding user response：

You should definitely use Python built-in functions. For instance, instead of position(list, value) you could use comprehension

[n for n,x in enumerate(dna) if x == 'A']

Finding a bigram could be reduced to the above if you consider pairs of letters:

[n for n,x in enumerate(zip(dna[:-1], dna[1:])) if x==('A','T')]

If instead you want to find the positions of either 'A' or 'T', you could just specify that as the condition

[n for n,x in enumerate(dna) if x in ('A', 'T')]

CodePudding user response：

Python will efficiently find a substring of a string starting from any point.

def positions(dnalist, substr):
    dna = "".join(dnalist) # make single string
    st = 0
    pos = []
    while True: 
        a_pos = dna.find(substr, st)
        if a_pos < 0:
            return pos
        pos.append(a_pos)
        st = a_pos   1