Home > OS >  git not detecting single-bit error in a tracked file
git not detecting single-bit error in a tracked file

Time:08-25

I recently discovered a single-bit error in a binary file that was included in a Git repository:

$ diff <(xxd old-clone/file) <(xxd new-clone/file)
< 00251230: 0412 c2bd 2e61 efeb 21b4 d904 3388 2539
---
> 00251230: 0412 c0bd 2e61 efeb 21b4 d904 3388 2539

Concerningly, git had not detected that anything was awry. I only found the problem because one of our tests had mysteriously begun to fail, without any committed changes.

When I did a fresh clone of the repository from the server, the test passed again. I now have a situation where I have two copies of the repo, both checked out at the same commit and both reporting a clean working tree, but with clearly different versions of this file:

$ cd old-clone && git status && cd ..
HEAD detached at 251265a4
nothing to commit, working tree clean
$ cd new-clone && git status && cd ..
HEAD detached at 251265a4
nothing to commit, working tree clean
$ diff old-clone/file new-clone/file
Binary files old-clone/file and new-clone/file differ

I've confirmed that:

  1. This file is definitely being tracked (not in .gitignore, not marked --assume-unchanged)
  2. git fsck reports no issues in either repo

My understanding was that this is supposed to be impossible:

Git Has Integrity: Everything in Git is checksummed before it is stored and is then referred to by that checksum. This means it’s impossible to change the contents of any file or directory without Git knowing about it. This functionality is built into Git at the lowest levels and is integral to its philosophy. You can’t lose information in transit or get file corruption without Git being able to detect it.

My belief in git's ability to maintain data integrity has been severely shaken.

How can this happen? And how can I ensure that it doesn't happen again (ideally without having to do a fresh clone of the entire repo every time)?

CodePudding user response:

Git has the correct data stored. But when you have previously fetched a tree it does not have a way to verify the files are correct.

There are a number of ways this can happen, but most boil down to on-disk bit-rot. If your filesystem has checksum system (zfs) you can be notified of this when you read the file (if you have no redundancy) or when you perform monthly scrubs (with redundancy).

You shouldn't be able to download the error (even with a tcp/networking double bit error) because git should verify the checksum when downloaded.

CodePudding user response:

Use the --full option

This question is related: somebody is deliberately modifying file contents in the object database (as a test) and expecting git to find it. It was not finding the corruption, until they added the --full arg. See this comment below the accepted answer for explicit verification.

What puzzles me is that the official docs state that "full" is now the default, which is why I didn't suggest this earlier. I did not look to see when that changed, but I suppose it's possible that you're using a version of git that does not default to "full" mode.

I suppose it is also possible that git treats implied "full" differently than explicit "full," although I consider that very unlikely.

  •  Tags:  
  • git
  • Related