Get gzip decompressed file size as fast as gunzip (no seek)-CodePudding

As some StackOverflow answers show, you can get the exact gzip decompressed file size using decompressedSize = gzipFile.seek(0, io.SEEK_END). Some people also suggest for files smaller than 4 GiB to do .seek(-4, 1). However, because it's seeking through the file till the end, it's very time consuming for bigger files (for approximately 1 GiB of decompressed data, it took few seconds to seek to the end).

I then tried using gunzip -l somefile.gz (same file) and it manage to instantly output the current file size as well as the file size when decompressed.

How am I able to get the file size of decompressed gzip as fast as gunzip if not even faster?

(P.S. The reason for me trying to get the decompressed gzip size is for a CLI progress bar when decompressing)

CodePudding user response：

The uncompressed input size is stored in the last 4 bytes [1], so the advice to start at -4 was correct.

The problem, however, is that your cursor must go 4 positions before the second argument, and so, 4 positions with respect to the end of the file, not the current position. Hence, 1 (SEEK_CUR) should be replaced with 2 (SEEK_END).

Once you set the position in place, you can read() just the last 4 bytes and then cast them to int [2]; the bytes order is little endian.

with open("yourfile", "rb") as f:
  # place the cursor in the right position
  f.seek(-4, 2)

  # get the size of uncompressed input from last 4 bytes
  size = int.from_bytes( f.read(), "little" )

CodePudding user response：

gzip -l is in fact seeking to and reading the last four bytes of the file. Your comment "because it's seeking through the file till the end, it's very time consuming for bigger files" suggests that you don't understand what seeking is. Seeking is not reading the entire file until you get to the end. Seeking is moving the read pointer of the file to the desired point, and reading from there. It takes O(1) time, not O(n) time (where n is the size of the file). @crissal's answer shows how to do this correctly.

Those last four bytes are the uncompressed length of the last gzip member, modulo 2³², assuming that there is no junk at the end of the gzip file.

You will notice three caveats in that sentence. First, as you have already noted, the uncompressed size needs to be less than 2³² bytes for that number to be meaningful. However, you can't necessarily tell by looking at the compressed file if that's true or not. gzip can compress to more than a factor of 1024, so the gzip file could be, say, only 2²² bytes in length, 4 MB, but decompress to over 4 GB.

The second caveat is that the gzip file must have only one member. The gzip format permits concatenated gzip members, for which the last four bytes represent the length of only that last member. There is no reliable way to find the other members, other than decoding the entire gzip file.

The third caveat is that the gzip file not have any junk on the end. In general I haven't seen that in the wild, but it is possible for there to be padding at the end of the gzip file, which would again confound finding the length.

Bottom line: if it is important to you to reliably determine the compressed size, then you can use the last four bytes only if you are in control of the generation of the gzip files, and you can assure that the content is < 4 GB, there is only one member, and there is no junk at the end.

For your application, you do not need to know the length of the uncompressed data. You should instead base your progress bar on the fraction of compressed data processed so far. You know the compressed size of the file from the file system, and you know how much compressed data you have consumed so far. If the data is approximately homogeneous, the compression ratio will be approximately constant throughout the decompression. For a constant compression ratio, a compressed-data progress bar will show exactly the same thing as an uncompressed-data progress bar.