Not a gzip format for a obvious gzip text in Java-CodePudding

I have been trying to implement decompressing text compressed in GZIP format Below we have method I implemented

private byte[] decompress(String compressed) throws Exception {
    ByteArrayOutputStream out = new ByteArrayOutputStream();
    ByteArrayInputStream in = new 
        ByteArrayInputStream(compressed.getBytes(StandardCharsets.UTF_8));
    GZIPInputStream ungzip = new GZIPInputStream(in);
    byte[] buffer = new byte[256];
    int n;
    while ((n = ungzip.read(buffer)) >= 0) {
        out.write(buffer, 0, n);
    }
    return out.toByteArray();
}

And now I am testing the solution for following compressed text:

H4sIAAAAAAAACjM0MjYxBQAcOvXLBQAAAA==

And there is Not a gzip format exception. I tried different ways but there still is this error. Maybe anyone has idea what am I doing wrong?

CodePudding user response：

That's not gzip formatted. In general, compressed cannot be a string (because compressed data is bytes, and a string isn't bytes. Some languages / tutorials / 1980s thinking conflate the 2, but it's the 2020s. We don't do that anymore. There are more characters than what's used in english).

It looks like perhaps the following has occurred:

Someone has some data.
They gzipped it.
They then turned the gzipped stream (which are bytes) into characters using Base64 encoding.
They sent it to you.
You now want to get back to the data.

Given that 2 transformations occurred (first, gzip it, then, base64 it), you need to also do 2 transformations, in reverse. You need to:

Take the input string, and de-base64 it, giving you bytes.
You then need to take these bytes and decompress them.
and now you have the original data back.

Thus:

byte[] gzipped = java.util.Base64.getDecoder().decode(compressed);
var in = new GZIPInputStream(new ByteArrayInputStream(gzipped));
return in.readAllBytes();

Note:

Pushing the data from input to outputstream like this is a waste of resources and a bunch of finicky code. There is no need to write this; just call readAllBytes.

If the incoming Base64 is large, there are ways to do this in a streaming fashion. This would require that this method takes in a Reader (instead of a String which cannot be streamed), and would return an InputStream instead of a byte[]. Of course if the input is not particularly large, there is no need. The above approach is somewhat wasteful - both the base64-ed data, and the un-base64ed data, and the decompressed data is all in memory at the same time and you can't avoid this nor can the garbage collector collect any of this stuff in between (because the caller continues to ref that base64-ed string most likely).

In other words, if the compressed ratio is, say, 50%, and the total uncompressed data is 100MB in size, this method takes MORE than:

100MB (uncompressed ) 50MB (compressed) 50*4/3 = 67MB (compressed but base64ed) = ~ 217MB of memory.

You know better than we do how much heap your VM is running on, and how large the input data is likely to ever get.

NB: Base64 transfer is extremely inefficient, taking 4 bytes of base64 content for every 3 bytes of input data, and if the data transfer is in UTF-16, it's 8 bytes per 3, even. Ouch. Given that the content was GZipped, this feels a bit daft: First we painstakingly reduce the size of this thing, and then we casually inflate it by 33% for probably no good reason. You may want to check the 'pipe' that leads you to this, possibly you can just... eliminate the base64 aspect of this.

For example, if you have a wire protocol and someone decided that JSON was a good idea, then.. simply.. don't. JSON is not a good idea if you have the need to transfer a bunch of raw data. Use protobuf, or send a combination of JSON and blobs, etc.