Home > Software design >  Fast python way to check if subsequence exists in a string and allow for 1-3 mismatches?
Fast python way to check if subsequence exists in a string and allow for 1-3 mismatches?

Time:08-20

I am wondering if there is a fast Python way to check if a subsequence exists in a string, while allowing for 1-3 mismatches.

For example the string: "ATGCTGCTGA" The subsequence "ATGCC" would be acceptable and return true. Note that there is 1 mismatch in bold.

I have tried to use the pairwise2 functions from the Bio package but it is very slow. I have 1,000 strings, and for each string I want to test 10,000 subsequences.

Computation speed would be prioritized here.

** Note I don't mean gaps, but one nucleotide (the letter A, T, C, G) being substituted for another one.

CodePudding user response:

One can zip the strings and compare tuples left and right side, and count false. Here false is just 1. Should not be slow...

st = "ATGCTGCTGA"

s = "ATGCC"

[ x==y for (x,y) in zip(st,s)].count(False)

1

CodePudding user response:

Try:

ALLOWED_MISMATCHES = 3

s = "ATGCTGCTGA"
subsequence = "ATGCC"

for i in range(len(s) - len(subsequence)   1):
    if sum(a != b for a, b in zip(s[i:], subsequence)) <= ALLOWED_MISMATCHES:
        print("Match")
        break
else:
    print("No Match")

Prints:

Match
  • Related