I created this function and it finds the location of the base in a dna sequences. Like dna = ['A', 'G', 'C', 'G', 'T', 'A', 'G', 'T', 'C', 'G', 'A', 'T', 'C', 'A', 'A', 'T', 'T', 'A', 'T', 'A', 'C', 'G', 'A', 'T', 'C', 'G', 'G', 'G', 'T', 'A', 'T']. I need it to find more than one base at a time like 'A''T'. Can anyone help?
def position(list, value):
pos = []
for n in range(len(list)):
if list[n] == value:
pos.append(n)
return pos
CodePudding user response:
You can work with the dna sequence as a string, and then use regex:
import re
dna_str = ''.join(dna)
pattern = r'AT'
pos = [(i.start(0), i.end(0)) for i in re.finditer(pattern, dna_str)]
print(pos)
[(10, 12), (14, 16), (17, 19), (22, 24), (29, 31)]
CodePudding user response:
side note, good not to use keywords for variable names. list is a python keyword
def position(l: list, values: list): -> list
pos = []
for i, val in enumerate(l):
if val in values:
pos.append(i)
return pos
CodePudding user response:
You should definitely use Python built-in functions. For instance, instead of position(list, value)
you could use comprehension
[n for n,x in enumerate(dna) if x == 'A']
Finding a bigram could be reduced to the above if you consider pairs of letters:
[n for n,x in enumerate(zip(dna[:-1], dna[1:])) if x==('A','T')]
If instead you want to find the positions of either 'A' or 'T', you could just specify that as the condition
[n for n,x in enumerate(dna) if x in ('A', 'T')]
CodePudding user response:
Python will efficiently find a substring of a string starting from any point.
def positions(dnalist, substr):
dna = "".join(dnalist) # make single string
st = 0
pos = []
while True:
a_pos = dna.find(substr, st)
if a_pos < 0:
return pos
pos.append(a_pos)
st = a_pos 1