I am wondering if there is a fast Python way to check if a subsequence exists in a string, while allowing for 1-3 mismatches.
For example the string: "ATGCTGCTGA" The subsequence "ATGCC" would be acceptable and return true. Note that there is 1 mismatch in bold.
I have tried to use the pairwise2 functions from the Bio package but it is very slow. I have 1,000 strings, and for each string I want to test 10,000 subsequences.
Computation speed would be prioritized here.
** Note I don't mean gaps, but one nucleotide (the letter A, T, C, G) being substituted for another one.
CodePudding user response:
One can zip the strings and compare tuples left and right side, and count false. Here false is just 1. Should not be slow...
st = "ATGCTGCTGA"
s = "ATGCC"
[ x==y for (x,y) in zip(st,s)].count(False)
1
CodePudding user response:
Try:
ALLOWED_MISMATCHES = 3
s = "ATGCTGCTGA"
subsequence = "ATGCC"
for i in range(len(s) - len(subsequence) 1):
if sum(a != b for a, b in zip(s[i:], subsequence)) <= ALLOWED_MISMATCHES:
print("Match")
break
else:
print("No Match")
Prints:
Match