I am trying to create two lists containing the 'start' and 'end' indices of a string. In this case, two strings are of equal lengths. For example
str1='ATGGATCGATCG'
str2='CGGGCGCGCGCG'
Here, matched lengths are: GG, CG, CG
I want the following kind of outputs:
list = [2,3,6,7,10,11] #list of the matched indices
start = [2,6,10] #start indices of the matched lengths
end = [3,7,11] #end indices if the matched lengths
Now, my chunk of codes looks like the following one but I want the indices to locate the matched sequences.
str1='ATGGATCGATCG'
str2='CGGGCGCGCGCG'
result1 = ''
result2 = ''
#handle the case where one string is longer than the other
maxlen=len(str2) if len(str1)<len(str2) else len(str1)
#loop through the characters
for i in range(maxlen):
letter1=str1[i:i 1]
letter2=str2[i:i 1]
if ((letter1 == letter2) and letter1 in ['A','T','C','G'] and letter2 in ['A','T','C','G']):
result1 =letter1
result2 =letter2
CodePudding user response:
This is practically crying out for zip
:
str1='ATGGATCGATCG'
str2='CGGGCGCGCGCG'
matches = []
for i,(a,b) in enumerate(zip(str1,str2)):
if a == b:
if not matches or matches[-1][1] != i-1
matches.append([i,i])
else:
matches[-1][1] = 1
print(matches)
starts = [k[0] for k in matches]
ends = [k[1] for k in matches]
Output:
[[2, 3], [6, 7], [10, 11]]
This will also capture single character matches. You can filter those out in a quick loop after, if need be.
CodePudding user response:
You could also do something similar w/regex.
import re
str1='ATGGATCGATCG'
str2='CGGGCGCGCGCG'
pat = 'GG|CG|CG'
matches = [[(m.span()[0],m.span()[1]-1) for m in re.finditer(pat,x)] for x in [str1,str2]]
m = set(matches[0]) & set(matches[1])
starts= [x[0] for x in m]
ends= [x[1] for x in m]
print(m,starts,ends, sep='\n')
Output
{(2, 3), (6, 7), (10, 11)}
[2, 6, 10]
[3, 7, 11]
CodePudding user response:
You can also use numpy.split
to split on nonconsecutive indices:
lst = [i for i, (s1,s2) in enumerate(zip(str1, str2)) if s1==s2]
splits = [0] [idx 1 for idx, (i,j) in enumerate(zip(lst, lst[1:])) if j-i != 1] [len(lst)]
start, end = zip(*[[arr[0], arr[-1]] for arr in np.split(lst, np.where(np.diff(lst) != 1)[0] 1)])
Output:
((2, 6, 10), (3, 7, 11))
CodePudding user response:
There are a few corrections to your code 1) max()
is a built-in, no need to do an if statement, 2) strings are already list type objects, hence "a" in "bbbbabb"
already returns True, no need to put each letter in a list.
It appears you need a function for determining how much the start of two strings agree by.
import itertools as it
def f(s,t):
return sum(it.takewhile(bool,map(lambda z:z[0]==z[1],zip(s,t))))
With such a function, we can now do as you describe and find all simultaneous matches of any length between strings:
str1='ATGGATCGATCG'
str2='CGGGCGCGCGCG'
matches = [(i,i l-1) for i,(a,b) in enumerate(zip(str1,str2)) if (l:=f(str1[i:],str2[i:]))>=2]
print(matches)
CodePudding user response:
Let's start with a helper function that will calculate length of common prefix of two strings at given index
def helper(index, str1, str2):
length = 0
try:
while str1[index] == str2[index]: #and other needed conditions
length = 1
index = 1
except IndexError:
pass
return length
Now we want to use it while iterating over
index = 0
result = []
while index < min(len(str1), len(str2)):
length = helper(index, str1, str2)
if length > 0:
result.append(i, i length)
i = length 1 # We can omit one character as it was checked in helper
else:
i = 1