Python - Including frequency table at the beginning of a Huffman-compressed file-CodePudding

I am trying to implement Huffman compression and decompression of files, where all the information needed to decompress must be included in the compressed file. For this implementation I want to include the frequency table in the compressed file, such that the decompression program can rebuild the Huffman codes from this frequency table and then decompress the file. The frequency table looks something like this, where each index maps to an ASCII-character's decimal representation:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 847, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4183, 13, 0, 0, 0, 6, 0, 0, 26, 26, 0, 107, 84, 598, 124, 36, 72, 66, 42, 21, 8, 16, 9, 11, 10, 10, 46, 0, 0, 7, 0, 3, 0, 21, 30, 4, 20, 19, 30, 5, 34, 35, 0, 9, 19, 15, 7, 10, 9, 0, 8, 15, 19, 1, 9, 8, 2, 1, 8, 24, 29, 24, 23, 8, 0, 439, 189, 40, 252, 1514, 226, 241, 82, 462, 62, 353, 346, 306, 521, 436, 212, 0, 977, 512, 663, 100, 176, 24, 10, 53, 9, 23, 374, 23, 2, 0, 197, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 65, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 90, 0, 124, 0, 0, 75, 14, 0, 0, 49, 0, 33, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 66, 0, 0, 34, 0, 0, 0, 0, 0, 0, 157, 154, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 49, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 200, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

I.e., index 32 of the list is 4183, which tells me that SPACE (ASCII# 32) appears 4183 times in the compressed file.

I also have code in place to create the Huffman codes and convert each character into its Huffman code and append it to a long bitstring. The following code is functional, and it converts the bitstring into a byte-array and saves it as a binary-file:

byte_array = bytearray()
for i in range(0, len(bitstring), 8):
    byte = bitstring[i:i   8]
    byte_array.append(int(byte, 2))

with open(output_file_path, "wb") as compressed_file:
    compressed_file.write(bytes(byte_array))

The resulting binary file is compressed from 17 KB to 10 KB successfully.

My problem is trying to include the frequency table at the beginning of this compressed file. I have tried several solutions but I run into problems and feel quite stuck.

Is there a simple way to include a frequency table such as above to the beginning of a compressed file in Python? Any tips for methods or functions that can be used to achieve this would be greatly appreciated.

I would want to achieve this with the frequency-table as-is and not using a Canonical Huffman code. And again, the compressed file alone and no other information must be enough to decompress the file without loss.

I have tried several function and methods that I find, but I am quite new to working with bytes and every method I have tried, such as converting the list to a bytearray, have failed. Because the list includes integers > 255 it won't convert to a byte-array like the bitstring does.

CodePudding user response：

No, you do not want to include the frequency table in your compressed data. You are trying to compress, so you want to use as few bits as possible to provide the information required to decompress. Sending the frequency table is the worst way to do that. The frequency table contains extraneous information that is not needed to reconstruct the Huffman codes. Many, many different frequency tables will produce the same Huffman code.

You instead want to send a representation of the Huffman code that was computed from the frequency table. Two of the most common ways are to send the tree, or to send the code lengths.

You can send the Huffman tree very easily by simply traversing the tree recursively, as you must have done to create the Huffman codes, and sending a 0 bit for each node encountered, and a 1 bit followed by eight bits for the symbol encoded for each leaf encountered. That's it. Nothing could be easier. Then you can reconstruct the tree directly on the other end with recursion, and use the tree for decoding. This tree representation is self-terminating, and so is immediately followed by the codes for your data.

In your example, you are encoding 100 different symbol. Then the tree will have 99 nodes and 100 leaves, and so will take 99 900 = 999 bits. For comparison, your frequency table, if represented as two bytes per frequency, would take 4096 bits. Or if four bytes per frequency as shown in another answer here, then 8192 bits! I could get fancy with encoding up to frequency 127 with one byte and higher with two bytes and get it down to 2148 bits. Still more than double 999 bits.

Though you exclude it, one could do better still by using a Canonical Huffman code, where you build the code only from the code lengths for each symbol, not from the tree. Then you can just send the code lengths and that same build process followed on the decoding end. You would then use Huffman coding on those lengths, preceding it with a very small representation of that Huffman code. This is what is done in Deflate compression. Deflate represents the code from your example in 608 bits.

CodePudding user response：

You can write your frequency table at the beginning of your binary file, converting the integers to bytes:

FREQ_TABLE_LEN = 256

def write_frequency_table(f, table):
    assert len(table) == FREQ_TABLE_LEN
    for e in table:
        f.write(e.to_bytes(4, byteorder='little', signed=False))

def read_frequency_table(f):
    read_table = []
    for _ in range(FREQ_TABLE_LEN):
        data = f.read(4)
        number = int.from_bytes(data, 'little', signed=False)
        read_table.append(number)
    return read_table

Here is an example of how can you use the previous code:

with open('compressed_file.bin', 'wb') as f:
    write_frequency_table(f, freq_table)  # freq_table is the list of integers in your question
    # write the real content of your file here


with open('compressed_file.bin', 'rb') as f:
    freq_table = read_frequency_table(f)
    # read the rest of your file