I've an application that requires to run git add
/commit
/push
on each single file i'd like to push, in order to trigger a Gitlab Job on each.
My problem is that git is actually taking many time to do the git push
command.
Here are the commands i'm using:
git add myFile.json
git commit myFile.json -m "commitMessage"
variable1 = git rev-parse HEAD # Storing last commit hash into a variable
# Pushing only one specific commit to (maybe) make it faster
git push $variable1:master
What i'd like to do is to make the whole process "faster". What i've thought about:
- Doing multiple pipeline triggers using only one
git push
(maybe by running the pipeline on each commit instead of each push), but it doesn't seem possible. - Doing multiple pushes in one
git push
command, so it doesn't have to reload some of thegit push
init operations before each file pushed (i have no idea on what is happenning during thegit push
process, so that idea may be wrong)
Does anyone has an idea on how to make this process faster, by using one of my ideas, or even a brand new one from you !
Note: I'm using HTTPS, so SSH solutions probably won't fit here.
Output of push command:
Enumerating objects: 191, done.
Counting objects: 100% (191/191), done.
Delta compression using up to 2 threads
Compressing objects: 100% (109/109), done.
Writing objects: 100% (162/162), 12.20 KiB | 1.74 MiB/s, done.
Total 162 (delta 49), reused 2 (delta 0)
remote: Resolving deltas: 100% (49/49), completed with 1 local object.
To https://localhost/root/xxx.git
14a71aaf..720100ac 720100ac47678fa31f0844a413f05bd0305d179f -> master
Thanks in advance !
CodePudding user response:
One way to increase the speed is, to use ssh ControlMaster
option. It avoid ssh to reopen a connection each time.
The ControlMaster option from ssh_config
man page :
Enables the sharing of multiple sessions over a single network connection. When set to yes, ssh(1) will listen for connections on a control socket specified using the ControlPath argument. Additional sessions can connect to this socket using the same ControlPath with ControlMaster set to no (the default). These sessions will try to reuse the master instance's network connection rather than initiating new ones, but will fall back to connecting normally if the control socket does not exist, or is not listening.
For example in my ~/.ssh/config I have:
ControlMaster auto
ControlPath ~/.ssh/mux-%r@%h_%p
ControlPersist 15m
CodePudding user response:
TL;DR
My guess is that you're letting GitLab use a shallow clone, which normally makes things faster, but in this case it's making things much slower.
Long
This is probably the key comment:
The long parts seems to be before Enumerating Objects, and during Writing Objects. The rest is almost instant.
A lot of this gets into the weeds of Git internals, which may change at any time without warning. It's therefore unwise to depend too much on these. Still, here's what's going on:
A Git repository is mostly made up of two databases. One database holds Git objects, and the other holds names: refs or references. The objects are numbered (by hash ID or object ID, OID). The refs turn human-usable names, such as
refs/heads/main
for branchmain
, into OIDs, with just one OID being stored per name.The OIDs are universal: each object has a unique OID, and all Gits everywhere number identical objects identically. This means that any two Git repositories can meet—whether it's for the first time, or for the nth with a large value of n—and yet one can nearly instantly find out whether the other repository has some object, just by handing over the object's ID. The sending Git lists out certain key OIDs, and the receiving Git responds with an "I have that" or "I want that" response.
The next thing we need is complicated technically, but possibly easy enough to visualize or understand. A commit object contains known metadata values, including a list of parent commits and one (single) tree object. A tree object consists of repeated tuples giving component names, object types (or "modes"), and object IDs. The objects listed in a tree are generally either another tree, or a blob object. A blob object represents a file's content: the name of that file is produced by stringing together the names of tree objects that led from a commit to that blob.
The parent(s) of any given commit are commits that existed at the time the commit itself was made. Commits are made one at a time. This means there cannot be cycles in the connections from commit to commit: if commit
H
links backwards to commitG
, commitG
cannot link forwards to commitH
becauseH
did not exist whenG
was made.G
can link back to still-earlier commits, but those cannot link forwards toG
orH
.Similarly, a tree object may not refer to itself, nor may any sub-tree within the tree object refer back to the tree object or any sub-tree that's "above" the sub-tree that is doing the referring. That is, if tree OID
9876543
exists, none of its entries can refer to object9876543
, and none of its subtrees—say,5566778
—can refer to9876543
either. So there cannot be loops in any set of trees found by starting at a commit. These rules mean that a tree is literally a tree, which is a subset of a DAG: see What's the difference between the data structure Tree and Graph?(Blob objects represent file contents, which are opaque to Git at this level: Git does not have to examine them, and does not do so.)
The end result of all of this is that the commits themselves form a Directed Acyclic Graph or DAG. Meanwhile, the top level tree object within each commit forms a tree of tree objects. So we have a DAG of trees, or DAG of DAGs, however you would like to refer to it; such a composition is itself a DAG. (Note that commits can re-use top level or sub-level trees from earlier commits: that's perfectly fine here as it does not break the DAG rules.)
(Atop all of this, we can have annotated tag objects, which store the hash ID of one target object. Because they're limited to a single target, and Git's hash ID computation rules forbid loops, these just add a few leads-into-an-object nodes to the overall DAG-of-DAGs. They add a little bit of complexity to a visualization, but do nothing to mess with the overall DAGginess.)
What all this boils down to, in the end, is that we have this overall graph structure with constraints: directionality and a lack of cycles. Any such DAG has a reachability property: that is, starting from some node in the graph, there may be other nodes in the graph that we can reach, and there may be other nodes in the graph that we cannot reach, by following the one-way connections: commit b789abc
has parent a123456
, so a123456
is reachable from b789abc
. As there are no cycles, this by definition means that b789abc
is not reachable from a123456
. (You cannot, however, infer the reverse: if node X is not reachable from node Y, that does not mean that Y is reachable from X. Perhaps W or Z reaches both X and Y, but X and Y are merely siblings in a tree, for instance.)
To this, we normally add one more constraint: a Git repository never has a "hole" in it. By this, I mean that if we have some node in the graph, we must always have every node reachable from that node. If a123456
is the parent of b789abc
, and we have b789abc
, we must also have a123456
. This in turn means that we must have the entire snapshot of a123456
. If a123456
has a parent commit, we must have the entire snapshot of that commit too, and so on.
Note the emphasis on the word normally above. When this is the case, if we are the sender and we're doing a git push
, we can often tell, just by knowing which commits are the latest commits in the receiving Git repository, everything about those commits. That is, if we have new commit b789abc
and they have its parent a123456
, we already have a123456
ourselves. We also have everything reachable from a123456
. So we know everything about every file they have, at least as far as a123456
and all of its ancestry is concerned.
This gives a sending Git a huge leg up: it tells the receiving Git I can send you commit b789abc
, would you like it? The receiving Git might answer with I already have b789abc
, in which case we know everything we need to know about the receiving Git, or it might say Yes I'd like that. If it says the latter, we, as the sending Git, must now offer the parent a123456
. They will either respond with I already have it or please send it, after which we'll offer its parent(s), and so on.
At some point, we either run out of commits to send—they have nothing and we must send every object—or we hit some commit that they have, which means that they have that commit and every earlier commit and we, the sender, now know precisely what files they have as well. So we can do a great job of sending them just the commits they need, and just the files they need for those commits, and we can compress those files knowing what earlier versions of those files that they already have.
Note that there's a big overall assumption here, that CPU time is cheap, but network bandwidth is expensive. We use this OID-exchange process to find what they already have, then we prepare the new objects and compress them against the known old objects. This ("compressing objects") part can take a lot of time, depending on how fast our own computation is. But it's usually pretty quick because typically we're sending just one or a few commits, with just one or a few new or modified files each, so there's not much to compress. We then send those objects, and that part is as slow as the network is, but if we did a good job of compressing, we don't have to send many objects and we've compressed them very well against other objects they already have.
Note, though, that if we git push
the current (HEAD
) commit, we must send them all parent commits back to the point where our history and their history join up. This maintains the "completeness" or "lack of holes" property. So your code here:
variable1 = git rev-parse HEAD # Storing last commit hash into a variable # Pushing only one specific commit to (maybe) make it faster git push $variable1:master
does no good; you could just git push HEAD:master
, or if your current branch name is master
on your (sender's) "side", you could just git push master
.
I mentioned above that a repository is made up of two databases. The process I've described so far is all about updating the object database. That's the really important one, because the names database is mainly there to help puny humans: all the machine needs is the raw hash IDs. Git won't go nuts trying to tell the difference between 720100ac47678fa31f0844a413f05bd0305d179f
and 720100ac47678fae1f0844a413f05bd0305d179f
, the way humans would (I made one tiny change here: can you spot it?). But we need to update the names database too, so following the above, the sending Git will send either a polite request, or a forceful command (or perhaps more than one of one or both, with different names in the blanks):
- Please, if it's OK, create or update your name _______ to hold ID _______; or
- Set your name _______ to hold _______!
(There's a third kind, a conditionally forceful update, where the sender says I think your name _______ holds _______. If so, set it to _______! That's the --force-with-lease
option, and it just enables a safer way to do the command.)
The receiving Git obeys, or doesn't, at its pleasure according to various rules (hosting servers generally add a bunch of control rules atop the super-simple ones that out-of-the-box Git provides). The sender's job is just to provide the names and hash IDs, along with the polite requests or forceful commands, and take back the reply from the receiving Git: "OK" or "rejected", along with any side messages that might come from a hosting server, e.g., telling a user that he does not have access rights to set that name. The sender then reports the result(s) to the person or process that ran git push
, and updates any remote-tracking names for the git push
operations that actually succeeded.
So, if you have several things to send, you can send them all together:
git push origin master mybranch:theirbranch
for instance. This has your sending Git collect the OIDs for master
and mybranch
on your side, send any commits and supporting objects required to their Git, then ask (politely) that they set their master
and theirbranch
to the OIDs your Git found for your master
and yourbranch
.
Things that go wrong: too many names and shallow clones
That's the normal process. Now let's see what might work to confound it.
First, some (not all) existing sending and receiving processes will sometimes go through every name they have in their names database. For repositories with tens of thousands of tag names, for instance, this can take a lot of time. This happens well before the "counting objects" phase even starts. If you have a network monitor or tracer, you will see a lot of data coming from the receiver, listing out all their branch and tag names and corresponding hash IDs, even though the sender doesn't really need all of that. There are some technical improvements here that are being worked-on in the C version of Git, but they've been in progress for a long time (multiple years now I think) and aren't there yet. If this is a problem, the simplest solution is to prune back a lot of the names, but this generally requires archiving, or at least renaming, the existing repository and making a new one (because some hash IDs will be findable only through the old names, and you probably don't want to lose those forever).
More importantly, go back to the word normally I put in bold-italics above. We can make a kind of Git clone, which Git calls a shallow clone, in which the usual constraints—that there are no "holes" in the graph—are deliberately violated. To implement this, Git writes certain commit hash IDs into a file, saying this commit is assumed to exist, but we don't have it, so we don't know anything about it.
When a receiving Git is shallow, the sending Git has a problem, and when a sending Git is shallow, the sending Git has a problem.