Home > Mobile >  Huffman: Can't decompress characters like 'æ', 'ø', 'å' and '
Huffman: Can't decompress characters like 'æ', 'ø', 'å' and '

Time:11-09

I'm working on my Huffman compression (or decompression rather at this point), and I can't get characters like 'æ', 'ø', 'å' and '•' to be decompressed correctly. The character 'æ' is decompressed to two symbols 'ᅢᆭ'. Any idea of what should be doing?

EDIT: I think it might have to do with the BufferedWriter and the InputStream (and the others). I probably need to read and write in UTF-8 or something? How do I do that?

EDIT 2:: With the help of some helplines I've found that 'ᅢ' and 'ᆭ' is written as individual characters to the file. Is 'ø' greater that 1 byte, and maybe I've assumed that every character is 1 byte somewhere?

public static void decompressFile() throws IOException {

    
    byte[] compressedBytes = //somecode
    int[] frequencyTable = //somecode

    HuffmanNode root = //some code

    //Generating code table
    String[] codeTable = new String[256];
    Huffman.getCodeTable(codeTable, root, "");

    DataInputStream inputStream = new DataInputStream(new BufferedInputStream(new FileInputStream("[//thecompressedfile]"
    BitInputStream bitInputStream = new BitInputStream(inputStream, compressedBytes.length);

    BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter(newFileName));


    HuffmanNode node = root;
    int bit;


    while ((bit = bitInputStream.readBit()) != -1) {
        //int bit = bitInputStream.readBit();
        System.out.print(bit   "");

        if (bit == 0) {
            node = node.getLeft();
            if (node.isLeaf()) {
                bufferedWriter.write(node.getAByte());
                node = root;
            }
        } else if (bit == 1) {
            node = node.getRight();
            if (node.isLeaf()) {
                bufferedWriter.write(node.getAByte());
                node = root;
            }
        }
    }

    bufferedWriter.close();
}

CodePudding user response:

Your read using an InputStream and write using a Writer. The first one is for reading binary data the second one is for writing text. You are doing an implicit conversion when you call bufferedWriter.write(node.getAByte()).

In other words, you're interpreting the binary data as ISO-8859-1, because you're basically casting a byte to a char (technically an int for ... weird reasons). Then you're writing it back with whatever the platform default encoding is.

This will mess up your text, unless it happens to be ISO-8859-1 encoded and the platform default encoding is the same.

A better approach would be to simply treat it as binary data throughout (it's fine if it's really text, as long as you don't care about interpreting the text in your code, which you don't seem to do). Since Huffman coding acts on byte streams, that also more closely matches what you're doing with the data.

To do so replace the Writer with an OutputStream (i.e. a FileOutputStream, possibly wrapped in a BufferedOutputStream for performance reasons).

  • Related