I have a text file that's 664KB in size (using ls -l). To my understanding, a file cannot be compressed into anything smaller than the Shannon Source Coding limit without incurring information loss. I used a program here to calculate the average shannon entropy of the textfile (4.36) and multiplied it by its number of characters. I get 371KB.
Then, I used bzip2 which to my understanding is lossless, and found that it compressed the file to 171K. From what i understand, nothing can be compressed smaller than the shannon limit without losing information, so how can bzip compress a file losslessly that is smaller than it? Am I missing some information as to how the os is encoding the file perhaps?
The text file I used for this experiment is MIT Classic's The Republic by Plato.
The program I used to calculate shannon entropy is this one. It gave me the same result as another program I used to cross-check it.
CodePudding user response:
It is true that generally we cannot compress better than the Shannon entropy (by assuming no loss), and it is true that all zip encoding are lossless.
However, a few points must be considered.
For the Shannon entropy (on the contrary of some kind of logarithm entropy), a statistical model is assumed for the information.
In some particular cases (not totally random, respecting some rules..), it may happen that no statistical model can perfectly handle all the a priori knowledge that we can have.
However, this is not the most important issue here. By looking at the code that you used, it appears that the only statistical information which is considered is the frequency of each character. This implicitly assumes that there is no correlation between the characters.
It is clear that it is a very restrictive assumption, certainly not valid for a text file.
It is clear that your compression algorithm is able to benefit from the correlation between adjacent characters.
CodePudding user response:
It all depends on the model.
The model used in your calculation is that each byte value has some independent probability. This is called "Order 0", since the probability at each byte location depends on zero preceding bytes.
A more sophisticated model would use information from the preceding bytes to generate a probability distribution for the current byte. bzip2 makes use of the context of other bytes, as do all general-purpose lossless compressors such as gzip and xz.
By the way, pigz has a -H
or --huffman
option which does Order 0 compression (Huffman coding only), and will get close to the Order 0 Shannon limit you are computing.