In a git push --force
,
- does it push ALL objects, regardless, of the 4 git type objects in a repo ?
- what goes on in the git engine ?
Thanks
CodePudding user response:
git push --force
does exactly the same thing that git push
does. The only difference is that the remote Git is not allowed to reject the push.
[Also: git push
does not, in general, push "objects". It pushes commits, with whatever that entails.]
CodePudding user response:
(I'm not sure you really wanted to ask the Big Question you did ask, but I'm answering that one. The little question, of whether --force
changes how git push
does the rest of its work, is easy: no, it just changes the final requests made to the other Git.)
- No—or more precisely, it doesn't have to, though it could.
- It's Complicated.
To answer point #2 properly, we need to look at many things. The first is the overall structure of any Git repository, as seen by some other Git repository (because we're interested in the results of connecting a pair of Git repositories).
A repository is composed of two databases. The first, and usually largest, is a simple key-value database, indexed by hash ID, that holds a collection of Git objects. The most interesting kind of object to most humans is the commit, but since we're going deeper, let's look at the four kinds of objects:
Tag objects (also called annotated tag objects ) hold annotations, plus one object hash ID.
Commit objects hold structured header text, then a blank line, then general header/subject line data, then another blank line, then a body. The structured headers begin with various keywords such as
tree
,parent
,author
,committer
, andencoding
and end at newlines. The blank lines are required if there are subsequent data, but can be omitted otherwise.The
tree
line must give the hash ID of some valid tree object; theparent
lines must give the hash ID of some valid commit object, one perparent
.Tree objects hold any number of data tuples (possibly zero, although there's only one empty tree and it's pre-allocated and hence present even if unused). Each tuple holds a mode, a component name, and a hash ID (in that order: see, e.g., this Python code). The component name must meet certain constraints,1 and the hash ID should generally be that of a valid tree or blob object, with the exception that entries of type "gitlink" (
mode 160000
) will contain an uncheckable commit hash ID corresponding to some commit that we will simply assume exists in some other Git repository.Blob objects hold data bytes. The data bytes are generally uninterpreted and are arbitrary binary data, but when a blob object hash ID stored in some tree object has
mode 120000
in that tree object, that blob object is used in asymlink
system call, to create a symbolic link, and the OS may constrain this in some way.2
Putting this together, we see that three kinds of objects—tree, commit, and tag—refer to other objects. These references, in Git, form a Directed Acyclic Graph or DAG (except for gitlinks, which Git doesn't verify or even attempt to follow at this time). The git fsck
command verifies that the DAG is in fact correct.
The second database in a Git repository is also a simple key-value store, but this time the keys are names—generally in the form refs/heads/
, refs/tags/
, and so on, which represent branch and tag names and other such names—and the values are hash IDs. Each key maps to exactly one hash ID, and that hash ID has to be the hash ID of some object within the objects database.
These names act as entry points into the DAG. Then, since objects in the DAG contain other object hash IDs, performing a transitive closure operation will find all the objects that are reachable from this name.
This transitive closure trick is how git push
works, except that it's more complicated than that.
1In particular, it cannot contain a zero byte (this terminates the component name) and it must not match .git
in any mix of upper or lower case. It should not contain a slash either, although what happens if you feed Git such a tree entry, I have not checked.
2The OS's constraints are up to the OS, and some OSes lack symbolic links in the first place, so Git doesn't make up its own constraints.
git push
syntax
The simplified syntax for a git push
operation is:
git push remote-or-URL refspec ...
where the ...
indicates that the refspec can be repeated.
The remote-or-URL is a way to locate the other Git repository, for instance by invoking a network connection. Using a named remote, such as origin
, enables several useful features that we can ignore here.
Ignoring numerous special cases, the refspec in a git push
consists of two parts: a source and a destination. These are separated by a colon: e.g., git push origin mybranch:yourbranch
. It's common, but not necessary, to use branch names on both "sides" of the colon, and when we do, we can omit the refs/heads/
part.
Our Git commands, on our side of the git push
, use the source part to locate specific objects to send:
git push origin a123456:refs/heads/somebranch
is a valid way to locate commit a123456
in our repository, if a123456
is in fact a shortened but valid commit hash ID. However, now we can't omit the refs/heads/
part.
We don't have to supply a commit hash ID as the source. Commit or tag hash IDs are the most common, though, and if we use a branch or tag name for our end of the git push
, our Git resolves that name to its underlying object automatically. This also enables us to drop the refs/heads/
or refs/tags/
from the destination part of the refspec, and even drop the colon as well:
git push origin mybranch
is short for git push origin mybranch:mybranch
which in turn is short for git push origin refs/heads/mybranch:refs/heads/mybranch
, assuming mybranch
is a branch name in my Git repository.
In the end, then, given a src:dst
style refspec, the src
part locates an object in our repository, and the dst
gives a name. This name is the name we'll ask the other Git repository to create, update, or delete. Omitting the :*dst
means "use the same name as we used for our src
", which requires us to use a name, not a raw hash ID.
Note: we can supply a tree or blob hash ID as the source. This is rarely—maybe even never—necessary. If we do that, though, the destination acquires additional constraints (some of which are enforced by the other Git repository). In particular, branch names are only ever allowed to store commit hash IDs, so if we ask them to set a branch name to point to a tag, tree, or blob, they'll refuse. But if we ask them to create a new tag name, they'll generally obey this request.
We can finally answer the --force
part, but not yet the other two parts
The force flags, when used, come in late in the git push
operation. After we've sent some set of objects, that's when we send over some name(s) and ask-or-command the other Git to set those names.
If the names we send are branch names, and those branch names already exist, the other Git will apply graph topology rules to the name request. The fact that we're asking them to set an existing branch name guarantees that the name did point to some existing commit in their repository before, and that we're now asking them to point to some other commit that will be in their repository if they accept our git push
.3 They now check to see if the new commit is a descendant (successor node in the DAG). If so, the push is allowed. If not, the push depends on whether a force flag was used.
There are at this point two different kinds of force-flag though: there's the full blown --force
(also represented by a leading
in a refspec), and there's the --force-with-lease
, called compare-and-swap internally. For the latter, the sending Git sends over a more complicated update request. Instead of either:
- if it's OK, please set _____ (branch name) to _____ (hash ID) (regular push);
- set _____ (branch name) to _____ (hash ID)! (forced push)
we send them a request of the form:
- I believe _____ (branch name) is currently _____ (hash ID). If so, set it to _____ (new hash ID). If not, tell me I'm wrong. (
--force-with-lease
, or compare-and-swap)
The use of this last form enables us to run git fetch
to get commits from them, verify which commits we're asking them to discard, and then specifically direct them to discard certain commits if their status is unchanged from our earlier fetch. If the lease-force is rejected, we know that they have acquired some update since then, and we should run git fetch
again.
3 In the old days of Git, git push
introduced the objects to their repository immediately, so that by the time they got around to inspecting any name updates and/or running any Git hooks and so on, the objects were already there. This created some serious issues for hosting sites like GitHub, which reject a lot of git push
operations. So someone put in a bunch of work so that incoming objects are placed in a "quarantine zone" of sorts. The receiving Git then checks the name-updates. For those that are are accepted, the underlying objects then migrate into the "real" database. For those that are rejected, the underlying objects are left in the quarantine zone. When the push is complete—with whatever success or failure status it has—the quarantine zone is erased.
What actually gets sent
First, we should note that there are no constraints on the sending Git. The sending Git can send whatever objects, with whatever hash IDs, it would like. If we are the receiver, and the sender sends us objects we already have, we simply ignore the excess objects. So they can send too much.
If we're the sender, though, it probably makes sense for us to avoid sending 10 gigabytes of Git repository data if we have only one new commit with one new file. So we should inspect the commits and other objects that we propose to send, and minimize the number of objects we send.
As we noted above, the transitive closure trick tells us what we would have to send, if the receiver has nothing at all. And if the receiver does have nothing, that's what we'll send: we'll use the src
part of each of the refspecs involved in this git push
to do a graph traversal, find all the objects reached in this traversal, pack them (into a so-called pack object) in any modern Git transport, and then send the pack file.
Aside: pack files. An object is a stand-alone entity in Git. Objects can be loose, in which case they're found in the Git repository as a file whose path name is built up from its hash ID. For instance, a blob object with hash ID 746c40be01c06f856ad828062678636ab2a04733
would be found in .git/objects/74/6c40be01c06f856ad828062678636ab2a04733
, as a loose object. Loose objects are zlib-compressed and (because it's the content bytes that form the hash ID) automatically de-duplicated, but that's it for compression. But many files are modified over time, with the modifications being small insertions and deletions. Using a delta encoding would allow Git to store multiple objects in a way that takes less space, and Git does exactly that. The deltified objects are stored in a pack file.
A pack file, then, contains multiple objects. Git places some extra constraints on normal pack files, then removes those constraints for what git push
and git fetch
send, which are then called thin packs. We can pretty much ignore the details here, and just think of a pack file as "a bunch of objects, delta-compressed against other objects we know are in the repository" since that covers both kinds of pack files.
Our goal as a sender, then, is to discover which objects are already in their repository. As it turns out, with certain exceptions,4 we have a really good, and really simple, way to figure that out.
When we start a git push
or git fetch
, we call up the other Git repository—the one we're going to get objects from, or give objects to—and they tell us about their branch and tag names and other names. Since they are a Git repository, they have a DAG of objects too. Their DAG is just as complete as every DAG (with those exceptions in footnote 4). So if they have, say, commit a123456
and we're going to send them commit b789abc
, we can look at our commit graph:
- We have
b789abc
and they don't: we need to send this commit object; - it uses some tree and blob objects too, so we need to send those, but
- the parent of
b789abc
isa123456
, which they already have, so we don't need to send that commit or any earlier (really, predecessor) commit, and - furthermore, since they have
a123456
, they have all the trees and blobs that go with that commit and every predecessor commit ...
so we can assume they have all those objects! This gives us the opportunity to compress our pack file by not sending any of those objects and by using delta-encoding against all those objects. So we can figure out that, say, we do have to send blob feedbac
, but even though that file is 10 MB, it's just one line added to blob babb1e5
that they already have, so let's make a pack file that says to make object feedbac
, add this one line to object babb1e5
.
Having built up this pack file—specifically a "thin" pack—of all the objects we need to send, we send over the thin pack. They will later "fix" the "thin pack" by adding to it all the referenced objects (that's the extra constraint Git has on non-thin pack files, that they be self-contained), but we get to send them just a few thousand bytes, perhaps, even though we're sending a commit that contains a 10 MB file.
4The main exception used to be shallow clones; now there are also partial clones, which are handled very differently and—I believe—currently don't count here. (In some cases, maybe they should count, but it's not clear which ones.)
Getting this completely right and minimal is computationally expensive
Actually getting a totally-minimal thin pack to another Git would in some case take hours of computation. The object weeding and pack compression that Git does with git push
and git fetch
—these two are pretty similar internally, except for the transfer direction, and the lease stuff that's all push-only—is a compromise, where Git tries to be reasonably fast about it, and reasonably minimal.
If you poke at the edges of Git—especially shallow clones—you'll find some cases where Git does poorly. One of the more interesting cases is that pushing from a shallow clone always does poorly if it's a --depth 1
clone. Doing a --depth 2
clone before working and pushing will often give you a much better git push
experience later, regardless of any --force
.