To preface, I am attempting to create my own compression method, wherein I do not care about speed, so lots of iterations over large files is plausible. However, I am wondering if there is any method to get the most common substrings of length of 2 or more (3 most likely), as any larger would not be plausible. I am wondering if you can do this without splitting, or anything like that, no tables, just search the string. Thanks.
CodePudding user response:
You probably want to use something like collections.Counter
to associate each substring with a count, e.g.:
>>> data = "the quick brown fox jumps over the lazy dog"
>>> c = collections.Counter(data[i:i 2] for i in range(len(data)-2))
>>> max(c, key=c.get)
'th'
>>> c = collections.Counter(data[i:i 3] for i in range(len(data)-3))
>>> max(c, key=c.get)
'the'