How would I go about finding the most common substring in a file-CodePudding

To preface, I am attempting to create my own compression method, wherein I do not care about speed, so lots of iterations over large files is plausible. However, I am wondering if there is any method to get the most common substrings of length of 2 or more (3 most likely), as any larger would not be plausible. I am wondering if you can do this without splitting, or anything like that, no tables, just search the string. Thanks.

CodePudding user response：

You probably want to use something like collections.Counter to associate each substring with a count, e.g.:

>>> data = "the quick brown fox jumps over the lazy dog"
>>> c = collections.Counter(data[i:i 2] for i in range(len(data)-2))
>>> max(c, key=c.get)
'th'
>>> c = collections.Counter(data[i:i 3] for i in range(len(data)-3))
>>> max(c, key=c.get)
'the'