I copied a large number of gzip files from Google Cloud Storage to AWS's S3 using s3DistCp (as this AWS article describes). When I try to compare the files' checksums, they differ (md5/sha-1/sha-256 have same issue).
If I compare the sizes (bytes) or the decompressed contents of a few files (diff
or another checksum), they match. (In this case, I'm comparing files pulled directly down from Google via gsutil
vs pulling down my distcp'd files from S3).
Using file
, I do see a difference between the two:
file1-gs-direct.gz: gzip compressed data, original size modulo 2^32 91571
file1-via-s3.gz: gzip compressed data, from FAT filesystem (MS-DOS, OS/2, NT), original size modulo 2^32 91571
My Goal/Question:
My goal is to verify that my downloaded files match the original files' checksums, but I don't want to have to re-download or analyze the files directly on Google. Is there something I can do on my s3-stored files to reproduce the original checksum?
Things I've tried:
Re-gzipping with different compressions: While I wouldn't expect s3DistCp to change the original file's compression, here's my attempt at recompressing:
target_sha=$(shasum -a 1 file1-gs-direct.gz | awk '{print $1}')
for i in {1..9}; do
cur_sha=$(cat file1-via-s3.gz | gunzip | gzip -n -$i | shasum -a 1 | awk '{print $1}')
echo "$i. $target_sha == $cur_sha ? $([[ $target_sha == $cur_sha ]] && echo 'Yes' || echo 'No')"
done
1. abcd...1234 == dcba...4321 ? No
2. ... ? No
...
2. ... ? No
CodePudding user response:
While typing out my question, I figured out the answer:
S3DistCp is apparently switching the "OS" version in the gzip header, which explains the "FAT filesystem" label I'm seeing with file
. (Note: to rule out S3 directly causing the issue, I copied my "file1-gs-direct.gz" up to S3, and after pulling down, the checksum remains the same.)
Here's the diff between the two files:
$ diff <(cat file1-gs-direct.gz | hexdump -C) <(cat file1-via-s3.gz | hexdump -C)
1c1
< 00000000 1f 8b 08 00 00 00 00 00 00 ff ed 7d 59 73 db 4a |...........}Ys.J|
---
> 00000000 1f 8b 08 00 00 00 00 00 00 00 ed 7d 59 73 db 4a |...........}Ys.J|
It turns out the 10th byte in a gzip file "identifies the type of file system on which compression took place" (Gzip RFC):
--- --- --- --- --- --- --- --- --- ---
|ID1|ID2|CM |FLG| MTIME |XFL|OS | (more-->)
--- --- --- --- --- --- --- --- --- ---
Using hexedit
, I'm able to change my "via-s3" file's OS from 00
to FF
and then the checksums match.
Caveat: Editing this on a file that is later decompressed may cause unexpected issues, so use with caution. (In my case, I'm doing a file checksum, so worse case a file shows as mismatching even when the uncompressed contents remained the same).