Start and End indices of the matched strings after comparison-CodePudding

I am trying to create two lists containing the 'start' and 'end' indices of a string. In this case, two strings are of equal lengths. For example

str1='ATGGATCGATCG'
str2='CGGGCGCGCGCG'

Here, matched lengths are: GG, CG, CG
I want the following kind of outputs:

list = [2,3,6,7,10,11] #list of the matched indices
start = [2,6,10] #start indices of the matched lengths
end = [3,7,11] #end indices if the matched lengths

Now, my chunk of codes looks like the following one but I want the indices to locate the matched sequences.

str1='ATGGATCGATCG'
str2='CGGGCGCGCGCG'

result1 = ''
result2 = ''

#handle the case where one string is longer than the other
maxlen=len(str2) if len(str1)<len(str2) else len(str1)

#loop through the characters
for i in range(maxlen): 
    letter1=str1[i:i 1]
    letter2=str2[i:i 1]
    if ((letter1 == letter2) and letter1 in ['A','T','C','G'] and letter2 in ['A','T','C','G']):
        result1 =letter1
        result2 =letter2

CodePudding user response：

This is practically crying out for zip:

str1='ATGGATCGATCG'
str2='CGGGCGCGCGCG'

matches = []
for i,(a,b) in enumerate(zip(str1,str2)):
    if a == b:
        if not matches or matches[-1][1] != i-1
            matches.append([i,i])
        else:
            matches[-1][1]  = 1

print(matches)
starts = [k[0] for k in matches]
ends   = [k[1] for k in matches]

Output:

[[2, 3], [6, 7], [10, 11]]

This will also capture single character matches. You can filter those out in a quick loop after, if need be.

CodePudding user response：

You could also do something similar w/regex.

import re
str1='ATGGATCGATCG'
str2='CGGGCGCGCGCG'

pat = 'GG|CG|CG'

matches = [[(m.span()[0],m.span()[1]-1) for m in re.finditer(pat,x)] for x in [str1,str2]]

m = set(matches[0]) & set(matches[1])
starts= [x[0] for x in m]
ends= [x[1] for x in m]

print(m,starts,ends, sep='\n')

Output

{(2, 3), (6, 7), (10, 11)}
[2, 6, 10]
[3, 7, 11]

CodePudding user response：

You can also use numpy.split to split on nonconsecutive indices:

lst = [i for i, (s1,s2) in enumerate(zip(str1, str2)) if s1==s2]
splits = [0]   [idx 1 for idx, (i,j) in enumerate(zip(lst, lst[1:])) if j-i != 1]   [len(lst)]
start, end = zip(*[[arr[0], arr[-1]] for arr in np.split(lst, np.where(np.diff(lst) != 1)[0]   1)])

Output:

((2, 6, 10), (3, 7, 11))

CodePudding user response：

There are a few corrections to your code 1) max() is a built-in, no need to do an if statement, 2) strings are already list type objects, hence "a" in "bbbbabb" already returns True, no need to put each letter in a list.

It appears you need a function for determining how much the start of two strings agree by.

import itertools as it
def f(s,t): 
    return sum(it.takewhile(bool,map(lambda z:z[0]==z[1],zip(s,t))))

With such a function, we can now do as you describe and find all simultaneous matches of any length between strings:

str1='ATGGATCGATCG'
str2='CGGGCGCGCGCG'

matches = [(i,i l-1) for i,(a,b) in enumerate(zip(str1,str2)) if (l:=f(str1[i:],str2[i:]))>=2]
print(matches)

CodePudding user response：

Let's start with a helper function that will calculate length of common prefix of two strings at given index

def helper(index, str1, str2):
    length = 0
    try:
        while str1[index] == str2[index]: #and other needed conditions
            length  = 1
            index  = 1
    except IndexError:
        pass
    return length

Now we want to use it while iterating over

index = 0
result = []
while index < min(len(str1), len(str2)):
    length = helper(index, str1, str2)
    if length > 0:
        result.append(i, i length)
        i  = length   1 # We can omit one character as it was checked in helper
    else:
        i  = 1