Fast python way to check if subsequence exists in a string and allow for 1-3 mismatches?-CodePudding

I am wondering if there is a fast Python way to check if a subsequence exists in a string, while allowing for 1-3 mismatches.

For example the string: "ATGCTGCTGA" The subsequence "ATGCC" would be acceptable and return true. Note that there is 1 mismatch in bold.

I have tried to use the pairwise2 functions from the Bio package but it is very slow. I have 1,000 strings, and for each string I want to test 10,000 subsequences.

Computation speed would be prioritized here.

** Note I don't mean gaps, but one nucleotide (the letter A, T, C, G) being substituted for another one.

CodePudding user response：

One can zip the strings and compare tuples left and right side, and count false. Here false is just 1. Should not be slow...

st = "ATGCTGCTGA"

s = "ATGCC"

[ x==y for (x,y) in zip(st,s)].count(False)

1

CodePudding user response：

Try:

ALLOWED_MISMATCHES = 3

s = "ATGCTGCTGA"
subsequence = "ATGCC"

for i in range(len(s) - len(subsequence)   1):
    if sum(a != b for a, b in zip(s[i:], subsequence)) <= ALLOWED_MISMATCHES:
        print("Match")
        break
else:
    print("No Match")

Prints:

Match