Implementations for Pattern/String mining using Suffix Arrays/Trees-CodePudding

I am trying to solve a pattern mining problem for strings and I think that suffix trees or arrays might be a good option to solve this problem.

I will quickly outline the problem:

I have a set strings of different lengths (quotation are just to mark repetitions for the explanation):

C"BECB"ECCECCECEEB"BECB"FCCECCECCECCECCFCBFCBFCC
DCBBDCDDCCECBCEC"BECB""BECB"BECB"BECB"BECB"EDB"BECB""BECB"BEC
etc.

I now would like to find repeated patterns within each string and repeated patterns that are common between the strings. A repeated pattern in string (1) would be "BECB". Moreover the pattern "BECB" is also repeated in string (2). Of course there are several criteria that need to be decided on as for example the minimum number of repetitions or the minimum number of symbols in a pattern.

From the book by Dan Gusfield "Algorithms on Strings, Trees and Sequences" I know that it is possible to find such repeats (e.g. maximal pairs, maximal repetitive structures etc.) using certain algorithms and a suffix tree data structure. This comes in handy as I would like to use probabilistic suffix trees to also calculate some predictions on these sequences. (But this is not the topic of this post.)

Unfortunately I can't find any implementations of these algorithms. Hence, I am wondering if I am even on the right path using suffix trees to solve the mentioned problem. It seems very strange to me that for such a well described problem no packages are available (in R or Python for example).

Are there any alternatives (with existing packages) that solve my problem?

Or do you know any implementation of algorithms for suffix trees?

CodePudding user response：

Here is an implementation in C that follows the approach of Dan Gusfield's book : https://cp-algorithms.com/string/suffix-tree-ukkonen.html

You could rewrite it in python. Such algorithms are quite specialised for high performance applications, so it's quite normal that they don't appear in any standard library; nonetheless they are still well known so you can usually find good implementations on the net.

CodePudding user response：

Both suffix trees and suffix arrays are good data structures to help in solving the kinds of problems you want to solve.

Building a (multi-string) suffix tree in python is probably not a good idea -- it involves a lot of operations on individual characters and the resulting data structure consumes a lot of memory unless you spend a lot of code avoiding that.

Building a suffix array in python is more approachable, and the resulting data structure (probably just an array of integers) is reasonably compact.

It doesn't seem too difficult to find suffix array code in python on the web, and there's a nice explanation here: https://louisabraham.github.io/articles/suffix-arrays

It would be more difficult to find one that already supports multiple strings, so you would have to decide how you want to do that. In any case, the prefix doubling algorithm is easy to implement if you leverage the standard built-in sort(), and in python that will probably produce the fastest result.