Home > Blockchain >  How would I go about finding the most common substring in a file
How would I go about finding the most common substring in a file

Time:04-16

To preface, I am attempting to create my own compression method, wherein I do not care about speed, so lots of iterations over large files is plausible. However, I am wondering if there is any method to get the most common substrings of length of 2 or more (3 most likely), as any larger would not be plausible. I am wondering if you can do this without splitting, or anything like that, no tables, just search the string. Thanks.

CodePudding user response:

You probably want to use something like collections.Counter to associate each substring with a count, e.g.:

>>> data = "the quick brown fox jumps over the lazy dog"
>>> c = collections.Counter(data[i:i 2] for i in range(len(data)-2))
>>> max(c, key=c.get)
'th'
>>> c = collections.Counter(data[i:i 3] for i in range(len(data)-3))
>>> max(c, key=c.get)
'the'
  • Related