Home > Net >  Efficient way to check if string A is contained in string B with at most k errors
Efficient way to check if string A is contained in string B with at most k errors

Time:05-01

Given a string A and a string B (A shorter or the same length as B), I would like to check whether B contains a substring A' such that the Hamming distance between A and A' is at most k.

Does anyone know of an efficient algorithm to do this? Obviously I can just run a sliding window, but this is not feasible for the amount of data I'm working with. The Knuth-Morris-Pratt algorithm (https://en.wikipedia.org/wiki/Knuth–Morris–Pratt_algorithm) would work when k=0, but I don't know whether it's modifiable to account for k>0.

Thanks!

CodePudding user response:

This is what you are looking for : https://en.wikipedia.org/wiki/Levenshtein_distance

CodePudding user response:

If you use the Levenshtein distance and k=1, then you can use the fact that if the length of A is 2n 1 or 2n 2, then either the first or the last n characters of A must be in B.

So you can use strstr to find all places in B where the first or last n characters match exactly and then check the Levenshtein distance.

Special case A = 1 characters: matches everywhere with one error. Special case where A = 2 characters ab: call strchr(a), if it fails call strchr(b).

  • Related