Is GZIP compression output stable?-CodePudding

I need to store remotely some chunks of data and compare them too see if there are duplications. I will compile a specific C program and I would like to compress this chuncks with GZIP.

My doubt is: if I compress the same chunk of data with the same C program using a gzip library on different computers, will it give the exact same result or could it give different compressed results?

Target PC/Servers could be with different Linux OSs like Ubuntu/CentOs/Debian, etc.

May I force same result by statically linking a specific gzip library?

CodePudding user response：

if I compress the same chunk of data with the same C program using a gzip library on different computers, will it give the exact same result or could it give different compressed results?

While it may be true in the majority of the cases, I don't think you can safely make this assumption. The compressed output can differ depending on the default compression level and coding used by the library. For example the GNU gzip tool uses LZ77 and OpenBSD gzip uses compress (according to Wikipedia). I don't know if this difference comes from different libraries or different configurations of the same library, but nonetheless I would really avoid assuming that a generic chunk of gzipped data is exactly the same when compressed using different implementations.

May I force same result by statically linking a specific gzip library?

Yes, this could be a solution. Using the same version of the same library with the same configuration across different systems would give you the same compressed output.

You could also avoid this problem in other ways:

Perform the compression on the server, and only send uncompressed data (this is probably not a good solution as sending uncompressed data is slow).
Use hashes of the uncompressed data, store them on the server and check them by making the client send an hash first, and then the compressed data in case the server says the hash doesn't match (i.e. the chunk is not a duplicate). This also has the advantage of only needing to check the hash (and avoiding compression altogether if the hash matches).
Similar to option 2, use hashes of the uncompressed data, but always send compressed data to the server. The server then does decompression (which can be easily done in memory using a relatively small buffer) and hashes the uncompressed data to check if the received chunk is a duplicate before storing it.

CodePudding user response：

May I force same result by statically linking a specific gzip library?

That's not enough, you also need the same compression level at the very least, as well as any other options your particular properties your library might have (usually it's just the level).

If you use the same version of the library and the same compression level, then it's likely that the output is identical (or stable, as you call it). That's not a very strong guarantee however, I'd recommend using a hashing function instead, that's what they're meant for.