Why does git ignore HEAD when figuring out a minimal delta to push? Any way to fix this (without cre-CodePudding

tl;dr: here's a repro of the issue:

#!/usr/bin/env bash
set -euo pipefail

git -c advice.detachedHead=false clone --depth=1 -b v2.38.0 https://github.com/git/git.git dummy_git1
cp -arT dummy_git1 dummy_git2

git -C dummy_git2 tag -d v2.38.0
git -C dummy_git1 branch dummy_branch1

echo "The command below should finish immediately, but it actually slowly copies all the blobs:" 1>&2
(set -x && git -C dummy_git1 push ../dummy_git2 dummy_branch1:dummy_branch1)

rm -rf dummy_git1 dummy_git2  # clean up

When git push is deciding which blobs to send, it appears to take all(?) existing refs into account for transferring the minimum amount of data, except for HEAD.

When the remote HEAD is attached to some branch, then this isn't an issue, because that branch is still considered as a base for the delta. However, when the remote HEAD is detached and thus the only pointer to the relevant commit, then for some reason git doesn't take it into account. Instead, it pushes everything up as if the remote didn't have any of the blobs, even if the remote in fact already had all of the blobs.

Basically, this means that git starts at refs rather than commits when computing deltas, resulting in unnecessarily large and slow data transfers when the necessary blobs already exist on the remote.

I have 2 questions:

Why does this occur?
Is there any way to work around this without creating a named ref on the remote?

CodePudding user response：

When the remote HEAD is attached to some branch, then this isn't an issue, because that branch is still considered as a base for the delta.

This is kind of close, but not quite right. It's not a question of whether the remote repository's HEAD is attached to a branch name. It is, instead, a question of whether your local Git can figure out an appropriate "thin pack".

Why does this occur?

Is there any way to work around this without creating a named ref on the remote?

The answer to 1 is complicated, but the answer to 2 is easy enough: "no".

There's a huge amount of detail here and it's easy to miss (or purposely gloss over) some item, so the description I'm about to give might miss things, but you need to remember that Git does all its work locally. Moreover, even when using file:// URLs or equivalent, the "local" and "remote" Git instances still talk to each other using the usual protocols (mostly).

Let's start out with this:

Object IDs (OIDs) are universal

A Git repository consists of two databases, one holding branch and tag and other names (mapping them to OIDs, one OID per name) and an objects database. Both are simple key-value stores and we mostly don't care about the first one here (we use it only to find the OIDs), so we concentrate on the second database.

This second database uses the OIDs as the keys; the data associated with each OID is the commit, tree, blob, or annotated-tag object data (including the type and length prefix). At the object-access level, the object data are always complete: a large object, e.g., 100 MBytes, is always the full 100 MB. The OID itself for the object is simply the result of running some checksum algorithm over the full data (including the header).

Git currently uses SHA-1, with the option of using SHA-256 instead. SHA-256 support is a bit spotty as yet, and there's no way to interconvert, so in practice the OIDs are all SHA-1 hashes. This isn't all that important here, but it helps as a concrete example: two different Git software implementations will, at various times, exchange OIDs so that one Git can tell if another Git has some object, or not.

Smart vs dumb transports; Git commit graphs

Before we get any further, we need to mention that transports (ssh, https, etc) come in two flavors, in Git: a dumb transport works with single objects at a time, and always sends or receives the entire object. You never get any delta compression at all here. For this reason, dumb transports aren't used very much.

A smart transport lets two Git implementations interact more tightly. The server (receiving for git push, sending for git fetch) and client (reverse these roles) can send "have" and "want" messages. (Before this, they can also agree on capabilities, but we don't need to worry about that here.)

Next, we need to take a look at the Git commit graph. The graph consists of objects, of the four object types, but three of those four object types can refer to other objects:

an annotated tag has a target object: it stores the hash ID of the object it is tagging (which may be any of the other object types, but is usually a commit);
a commit has parent commits and one tree object: it stores the hash IDs of some previous set of commits, and of one tree;
a tree object contains a flat list of <mode, hash-ID, name> tuples, where the mode defines the type of object the hash ID refers to, and the name is a name component (the parts that go between the forward slashes, in path/to/file, for instance); and
a blob object contains uninterpreted data.

Trees refer to sub-trees, and also to files (mode 100644 and mode 100755), symbolic links (mode 120000), and gitlinks (mode `160000). Sub-trees represent more tree objects; files and symlinks use blob objects to hold the file data or symlink target; and gitlinks are terminal data (don't connect to anything in this repository—they are hash IDs that are just assumed to exist in some other repository).

By starting at the appropriate names (branch names, tag names, replacement object names, etc) and traversing this graph to produce a transitive closure, we can find out which objects are required to make that set of names "work". Adding depth limiters (--depth n for some integer n) lets us limit the graph traversal.

To copy all of a repository, we simply do a full traversal of all reachable objects to get the full transitive closure from all names. If we want to fetch or push from a smaller set of names, we can do a full traversal from those names only. This produces a set of OIDs that must be present.

This is not super-efficient, obviously, but it's our basic starting point.

Compression, and loose and packed objects

Because committed files are often very similar to (while not being 100% exactly the same as) previous committed files, we'd like Git to do some kind of fancy compression. There are many options for compression; the two Git uses are zlib and delta encoding.

Zlib compression is usually pretty fast, so Git always does it for every object. This includes what Git calls loose objects, which are otherwise stored as the full object data. Regardless of how small or large the object is, Git compresses it (including its header) during or after computing the hash ID of the uncompressed data to find the OID, assuming this is a new object, and then stores that object in the computer's file system as a file whose name is made from the OID.

Loose objects are therefore not delta-encoded, and if we have made ten versions of some moderately large source file, creating ten separate loose objects, Git might be able to store them more efficiently. Git therefore will run git pack-objects now and then to create a pack file from various objects. To make such a file, Git takes similar objects and groups them together and does a lossless, binary-data delta compression, taking runs of bytes from "earlier" objects to use in "later" objects. These deltas can be chained, e.g., the "latest" object might say "take 4000 bytes from earlier object" but the packed earlier object starts with "take 500 bytes from still-earlier object".

Note that "earlier" and "later" here are kind of arbitrary: in terms of delta compression, there's no particular reason to use an object created in April before one created in May. In practice, we tend to use newer commits more often than older ones, so Git tries to keep the newer objects towards the head of the delta chain, and older ones towards the back. This is tricky since objects themselves have no date information: instead, Git propagates commit and tag date information backwards into the objects during the graph traversal, as a sort of best-effort thing. It actually works pretty well in practice.

Now, within any one given repository, there's also a secondary rule: if a packed object (e.g., hash ID P3) refers to another packed object (hash ID P2) that refers to another object (hash ID P1), all three of those packed objects must be present in that pack file. This means that if you have a pack file open, and are trying to read an object out of it, you won't have to open any other objects: everything you need is here, in this pack file.

This is bad for fetch/push, so Git adds an exception to this rule: a thin pack may refer to objects by their hash ID, yet not contain those objects.

Smart transports use thin packs

We can now consider how git push works for your case. Assuming a smart transport—which is the usual case and is in effect here—we have the sending Git as the client start up, and call to the server as the receiving Git.

The sender now asks the receiver to list out their branch and tag and other names, plus the corresponding OIDs. For each OID, we—as the sender—can assume that they have those objects.

There's a further assumption we can make if they tell us that they're not a shallow repository:¹ if they have some commit object, they have every earlier commit as well. They have all the objects (tree and blob objects) that go with those commits and all earlier commits.

If they are a shallow repository, though, we can only assume that they have those particular commits.

We now look in our own repository to see which commit(s) and/or other objects we want to send. We need to send the latest commit that we are explicitly git push-ing, of course; but we also need to send every earlier commit we have up until the point where we hit a commit that they have. If they say that they have commit X, for some X, and our graph traversal hits X and we're not doing the shallow clone case, we can stop here!

We now have a list of objects, locally, that we think they need. We make a thin pack out of these objects, delta-compressing against objects that we've detected that they have, if we have those objects. If we are a shallow clone, we might not have the objects in the first place, in which case, we can't do good delta compression.

One way or another, we build a thin pack and send it over. The server, receiving the thin pack as part of the git push, "fixes" the thin pack by "fattening it" with any delta base objects required by the deltas, using the objects it already has locally.

(The receiver now runs any pre-receive and update hooks as usual, and if all goes well, updates name(s) in its names database as the sender requests at the end of the git push. In modern Git, the pack, now fattened, gets moved into the receiver's object database, out of the quarantine area. In old Git, it's already there: there was no quarantine area.)

¹Git's newfangled partial clones may blow a big hole in this: it's a design choice as to whether to treat a partial clone like a shallow clone, or not, or perhaps even just forbid pushing to partial clones. I have not yet looked to see which choice was made here.

This is the answer to your first question

The lack-of-delta, very-fat "thin pack" occurs because our local Git doesn't realize that they—the receiving Git repository—has a sufficient set of base objects.

Note that when we (client) have a shallow clone, we often need a depth-2 or deeper clone to get proper detection, even if the server has the right names and hash IDs. Git's algorithms here could be better—it's possible to go below just looking at commit hash IDs alone—but the Git authors made some deliberate choices for the graph traversal to favor smaller amounts of CPU use for rare-ish (shallow clone) cases.