Its not clear to me what using the dict does when using it with zlib? Does anyone know what its purpose it or how it works? I've been searching google and youtube with very little luck in learning what its doing. My assumption was that it was filtering the inputs and outputs but that doesn't seem to be it. It looks like it's using it as some kind of key for compression and decompression. Is that correct? Any help is appreciated.
CodePudding user response:
At every point in the uncompressed data, zlib uses the previous 32K of uncompressed data in which to search for a sequence of bytes that matches the data at the current position. Much of the compression comes from coding the distance back and the length of the matching sequence, instead of the bytes themselves.
When zlib starts at the beginning of the uncompressed data, there is no previous 32K! And for the first 32K, zlib is operating at somewhat of a disadvantage, without a full 32K of history.
Providing a dictionary gives zlib a head start by giving it a "previous" 32K of data that it doesn't have to compress. You would try to populate that dictionary with sequences of bytes that you might expect to see in the data that you're compressing.
The bargain you make with zlib is that you will provide that exact same 32K of dictionary on the decompression end, so that zlib doesn't have to include it in the compressed data. zlib will however encode a check value of that dictionary in the header, so that you can verify (to some extent) that you have the right dictionary at the other end, and perhaps even to use that check value to select among several dictionaries that may be used.
If you're compressing large input, this head start on the first 32K really won't make much difference. However if you're trying to compress short sequences of bytes, and you know what to expect in those short sequences, then a dictionary can make a huge difference.