I'm looking for an algorithm for finding all strings in a set of substrings that are contained within a string.
There are a lot of algorithms for this obviously, but in my case most of the time I expect none of the substrings to be found within the string.
So I need an algorithm that is very fast in the case that none of the substrings are contained, even at the cost of the case where they are (since there is additional code at that point anyway).
If it helps, the size of the full strings are usually between 500 bytes - 2 kilboytes, and the size of the substrings are a few bytes ( examples: "%sub", "hh'", "Tracing"), and there could be 5-20 of them.
I'm writing in C# so I'd prefer an answer in that, but even just an abstract algorithm will do.
CodePudding user response:
Consider using Aho–Corasick algorithm. It is intended for the case of muliple patterns and has linear complexity (plus number of matches, low in your case).
CodePudding user response:
If you perform multiple search of a few short strings, I guess that a trie is ideal for you. https://en.wikipedia.org/wiki/Trie
As a secondary optimization, in case of a failure the trie could tell you how many characters you can skip to continue the search. (But I have no idea how to build the skip table.)