I am trying to decompress a raw DEFLATE data stream in order to understand, how the DEFLATE decompression works. I don't care for performance at the moment. I just want to understand the decompression algorithm in a detailed way so I searched online for examples and stumbled upon the program puff.c
by Mark Adler.
I want to decompress this piece of raw DEFLATE data:
05 C1 01 01 00 00 08 C3 20 6E FF CE 13 44 D2 4D C2 03
It is a single block compressed with dynamic huffman codes. RFC-1951 really gives a good overview about the overall structure of the block. After the block header, there are 4-19 3-bit integers, that define the code lengths for the code length alphabet; after that there are the code lengths for the literal/length and the distance alphabet and finally there is the actual compressed data. So far so good...
I took a look at puff.c
to see how an implementation of the huffman decoding process should look like.
In the function construct
(lines 340-379), the symbol table for an alphabet is created which is then used in the decoding proccess. In the function decode
(lines 235-255) a single symbol is decoded using the symbol table and the table of code length frequencies. This function gives my headaches. I do not understand the relationship between the symbols, the code length frequencies and the actual input bit stream. Also the check
if (code - count < first) /* if length len, return symbol */
on line 247 is pure witchcraft to me. I searched the internet for detailed explanations of the decoding process but I found nothing that explains it so deeply.
Could you please provide an explanation of the process/the decode function?
Here is a link to puff.c
in case you want to take a look.
CodePudding user response:
The key is to understand that the Huffman codes used in deflate are canonical, which means that the sequence of zeros and ones assigned to the symbols are determined entirely just from the bit lengths of the codes and an ordering of the symbols. The ordering is specified in the RFC, where, for example, literal/length codes start with the literal bytes in order from 0 to 255, then a single end-of-block code, and then the length codes in order from shortest to longest. For any given, say, literal/length code described in a dynamic header, usually only a subset of those symbols will actually be used in the block and have codes defined for them.
The convention in deflate is to assign the first ordered symbol of the shortest code to be all zeros. Then the code is incremented as an integer (that's key), for the remaining symbols of that length, in order. To step up to the next length, a zero bit is appended to the result of incrementing after the last symbol of the previous length (i.e., it is doubled).
For decoding, all I need is a) a list of the symbols that are coded sorted in order of bit length, and within each bit length, sorted by their symbol order, and b) the number of symbols for each bit length, which has to add up to the number of symbols in the list.
The decoding scheme used in puff.c is to start with the shortest bit length in the list of bit lengths, and get that many bits. Now I check to see if the value of those bits, as an integer, is less than or equal to the last code of that length. If so, I have all of the bits of the current code, and I can return the corresponding symbol by indexing into the list of symbols in a). If the value of the integer is greater than the last code, then I know I need more bits for this code. I append one more bit from the stream to the integer, and I repeat the process for the next code length.
This is complicated a little by the fact that the bits are stored in reverse order in the stream. I need to flip the order as I build up the integer from successive bits, so that I can interpret the result as an integer that can be compared.
So, if (code - count < first)
can be read as if (code < first count)
, or if (code <= first count - 1)
, which is whether code
is less than or equal to the last code of that length. If so, we have the code, and the corresponding symbol is h->symbol[index (code - first)]
, where index
is the position in the list of symbols of the first symbol of the current length.