Git fails to keep the integrity of a .vmdk file-CodePudding

I've been trying to keep track of my VM state so that I can always revert to older version in case I messed up during a test.

Let's say I got into the directory, did git init; git add .; git commit -m 'restore point'.

Originally, I have a file named Tiny10.vmdk that weights around 19GB, which i then renamed as Tiny10_old.vmdk. Now I suppose we can assume that Tiny10.vmdk is no longer in the working directory as recorded by git. So I tried:

git restore Tiny10.vmdk

to revert the changes

However, I found that the new Tiny10.vmdk now only weights 2GB and is surely corrupted.

How does this happen? Is it a bug?

Does git has a size limit to files it can track?

How do I fix this?

Is it a good idea to keep track of .vmdk file to begin with?

P.S.: .vmdk is short for VMware virtual disk file

Script used to reproduce bug

git init
git add .
git commit -m 'restore point'
mv Tiny10.vmdk Tiny10_old.vmdk
git restore Tiny10.vmdk

CodePudding user response：

Git in general cannot corrupt a file for various reasons—but Git can hit its own limitations. The 2GB and 4GB size numbers in particular are rather magic on some 32-bit machines, possibly including your Windows system. If that's the case, you'll need a different build of Git and/or a different OS.

Gory details

Git is written mostly in C.¹ C has some simple fundamental data types: char, short, int, long, and since 1999, long long, plus signed and unsigned qualifiers that can be applied to these types. There's no prescribed mapping from each of these C types to machine-level hardware instructions, but there's a very common set of principles used to avoid surprises here: char (and its signed and unsigned variants) map to an 8-bit byte, which can only store values from 0 to 255 inclusive (or -128 to 127 inclusive when signed), short maps to a 16-bit "short word" with range 0 to 65535 or -32768 to 32767, int maps to either 16, 32, or 64 bit, long is at least a 32 bit type with range from 0 to 4294973647 or -2147483648 to 2147483647, and long long, if it exists in your implementation, is a 64-bit type which maps from 0 to 2⁶⁴-1 or -2⁶³ through 2⁶³-1.²

The C code that Git uses was, for a long time, pre-C99 and avoided direct use of long long. This is I think finally relaxed now (though other C99 imports such as declarations inside for loops are still being avoided), but if we do avoid long long and max out at long, this may max out at 4 GB (4294973647 or 0xFFFFFFFF). When signed, it may max out at 2 GB. Adding 1 to an unsigned long that holds the maximum possible value produces a variable holding zero (the usual finite field arithmetic result we'd expect, in othe words). When using signed numbers, C doesn't prescribe whether we get this kind of wraparound, or an overflow trap, or "sticky" arithmetic, or whatever, but for compatibility with popular implementations we generally see the same kind of wraparound, so that 2147483647 1 equals -2147483648 (0x7FFFFFFF 1 = 0x80000000, which is then treated as the most negative two's complement value).

When a C-based Git implementation has these 32-bit limitations, the largest possible file size is either 2 GB or 4 GB (minus one), depending on whether the file size is stored in a signed or unsigned integer. Ideally, C Git on a Windows system should at least notice that some file is bigger than it can store, and give you an error, rather than taking its size mod 2³¹ or 2³² and using that and pretending all is well. You might consider filing this as a bug against your particular Windows version of Git.

¹There is a JGit Java version, and a version in Go, and apparently one happening in Rust and probably other languages as well. But Git-for-Windows and unqualified "Git" tends to refer to the C version. Every version has its own quirks so if you're hitting a quirk and you might be using something other than C Git, find out what specific version you're using. Even if you're using CGit, it has a long history, with various versions having various bugs, so see what version you're using.

²I've typed these in from memory without looking closely as I typed so beware typos. Use a binary calculator to find 2³² and so on to double check exact values.