Home > Software engineering >  Git commits are snapshots, not diffs. Then why is rebase necessary to remove old commits?
Git commits are snapshots, not diffs. Then why is rebase necessary to remove old commits?

Time:09-09

My understanding is that for all intents and purposes, Git commits are snapshots, not diffs. So even though Git will internally "delta-compress" snapshots to eliminate redundancy, in theory each commit is a full representation of a code base at any one point in time, and it doesn't need a previous commit to make sense. (I'm aware that I could be mistaken at this point but this is my current understanding.)

So if that's the case, say I have five commits:

A - B - C - D - E

And I decide that I don't care about B to D any more - E is my canonical commit.

In this case, I guess I would use git rebase -i HEAD~4 and then either drop or squash commits B, C and D.

But if commits are snapshots, why do I need to do this? To my mind, this implies that commit E depends on the history contained in B, C and D, and that if Git was naive and allowed me to just "delete" B C and D, then all hell would break loose. And this suggests a diff system, not a snapshot system. Why doesn't Git let me just delete these commits without complaining, rather than asking me to "rebase" them onto A? Wouldn't a fully snapshot-based system allow me to use E as the "canonical" commit and not care about what came before it?

I would appreciate any corrections in my mistaken understanding here.

CodePudding user response:

tl;dr: Yes, changing history in git means rewriting commits, because commits always contain links to their predecessor(s).

But if commits are snapshots, why do I need to do this? To my mind, this implies that commit E depends on the history contained in B, C and D, and that if Git was naive and allowed me to just "delete" B C and D, then all hell would break loose.

Yes, exactly - commit E does depend on the history.

However, this is not because Git uses diffs, but because each commmit contains a link to its predecessor (aka "parent").

This is a central design decision in Git - commits are linked into a history via "backlinks" - each commit points to its predecessor (or multiple predecessors for merge commits). Also crucially, this "pointing" is by referring to the commit ID, which is a hash over the commit - so you cannot just "delete a commit from history" - doing so would mean changing the parent of some commits, which in turn requires rewriting that commit, because the changes change the commit it.

This was done deliberately by Linus Torvalds, to make sure that a specific commit ID always refers to the same code and the same history - important in the case of the Linux kernel, where many people will provide commits.

CodePudding user response:

Ground of being of your question: you have

A - B - C - D - E

But you want

A - E

Fine. Let's start by clearing up some misconceptions in your question.

Commits are immutable

Commits are not just snapshots. They are also immutable. A commit different in any way from E would not be E. It might have the same commit message as E, but if it differs at all from E it is a different commit.

Parentage is part of a commit

Well, I just said it, but I'll say it again. Commits are not just their contents. They contain a bunch of other stuff, including, in particular, the information as to who their parent is.

Ok. Now, in your diagram, D is the parent of E. A commit that did not have D as its parent would have a different parent. And we have just said that commits are immutable. Therefore, if we imagine E with A as its parent, that would not be E. Again, we might call it "E" in a kind of informal way; it might have the same commit message as E did; but it would be a different commit.

No cutting and pasting

The above is sufficient to explain why you cannot merely "cut" B, C, and D, and then "paste" E to go after A. You cannot "paste" E anywhere! There is only one E — not just in your repo but in the entire universe — and D is its parent, and that's the end of that.

Therefore in order to eliminate B, C, and D from the visible history, we must somehow rewrite E to make a different commit, one that has A as its parent. And that is the must basic reason why rebasing (or something similar) is necessary in order to generate the history you want. The history

A - E

is impossible. What you can have is

A - E'

where the second commit, whose name is read "e-prime", bears some similarity to E in your mind, but is not E. And you need a coherent way to make that. That is why you must do some kind of dance in order to change history.

But that dance need not be rebasing — as we shall now see. Sip your coffee and let's go on.

No rebase necessary

Now let's say that what you mean by your question is this: B, C, and D were steps along the way to the golden truth, but E contains that golden truth and I don't really need to "show my work" to my colleagues and the world. So I'd like to hide the intermediate steps and just go from A to E — sorry, to E' — as a pure and simple statement of what happened.

Then you don't actually need to rebase. Just say

git reset --soft <SHA of A>
git commit -m 'message identical to the E message'

That will result in your desired

A - E'

What did we just do? We started with your own stated fact: a commit contains a snapshot. When we said reset --soft, we basically produced the contents of that snapshot into both the working tree and the index. So then, when we made a new commit, that commit was a snapshot of the project in exactly the state that E had described it. But at the time we made this new commit, the HEAD was A. So the parent of this new commit is A! Problem solved.

So yes, we could have rebased and squashed the intermediate commits to get the same result. But that is just a convenient way for people who don't know how, or can't be bothered, to accomplish the same thing more directly. Git interactive rebase is an intelligent and convenient shorthand for certain common kinds of history transformation; but it does nothing you could not have done yourself, in some other way.

No rebase necessary, ever

I just want to impress upon you that you never need rebase. Everything it does can be done by a series of sometimes tedious and elaborate steps, much more basic and direct and (probably) inconvenient.

For example, suppose you wanted to eliminate B and C but keep D and E (well, actually D' and E', as we already know — D would be replaced by D' because the new commit's parent is A, not C, and E would be replaced by E' because the new commit's parent is D', not D).

You can do it without interactive rebase. Start a new branch at D. Reset that branch soft back to A and commit with D's commit message, just as we did before. Now cherry pick E onto the end of this new branch and give the resulting commit E's old message. Now clean up the branch situation, and you're done.

I'm not saying that's easier than interactive rebase; it's obviously not. But that's not the point. The point is that interactive rebase is just a crutch. A really great crutch! But it isn't magic.

Drop is not squash

Finally, you said something in your question that was very wrong: you said "drop or squash". There is no "or" here: those are totally different things! Squashing maintains the contents of the last commit in the series of squashes. Dropping does not! If you dropped B, C, and D, the resulting E' would contain a snapshot that looks nothing at all like the current E!

For example, suppose B includes a new file myfile that A does not have. And suppose C and D and E all have that file too. Then dropping B would result in an E' that lacks the file myfile.

This is because, although commits are not diffs, diffs do exist (in Git's mind, not written into the history) and they are used in the process of merging. And rebasing is actually a form of merging (I don't want to get into that just now). So by dropping B, you are reversing the diff that got you from A to B. And since part of that diff is the creation of myfile, reversing that diff is a way of saying not to create myfile. And so myfile would not appear in E' when the rebase was over, even though it did appear in E.

What to squash

Last but not least: this phrase in your question was wrong too: "squash commits B, C and D". No. To get the result you are after with an interactive rebase, you would squash C, D, and E. In other words, the pick list would look like this originally:

pick f343cc4 B
pick f750aa9 C
pick 0105b79 D
pick 46fe327 E

and you would edit it to look like this:

pick f343cc4 B
squash f750aa9 C
squash 0105b79 D
squash 46fe327 E

You would then select E's commit message as the commit message for the resulting new commit.

CodePudding user response:

Well... using rebase is just a tool to ease doing what you want to achieve which is to have A <- E, is that right?

If you want to have A <- E with the contents just like what it is in E, you do git rebase -i A then you set squash in revisions C to E, B is kept as pick.

If you did drop on B, C and D, you will get A then git will attempt to apply the changes that are introduced by E from D, so you might get conflicts.... you might not, it depends on what A is like, and what D and E look like to be able to tell.

Coming back to my point, you can do the same stuff with commands.... the first option:

git checkout E
git reset --soft A
git commit -m "The new E"

Or you could also run:

git commit-tree -p A -m "The new E" E^{tree}

But we are going into rather deep stuff now as that is a plumbing command, not supposed to be used by us laypeople (though it's still possible, as you can see).

The second option would be:

git checkout A
git cherry-pick E

So, not reeeeally any need to use rebase it's just very simple to use.

And it's definitely based on snapshots.... take a look at https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

CodePudding user response:

See matt's answer for most of the answer. As an answer for a slightly different "why does git rebase exist in the first place" question, though, let's look at what I call the snapshot-diff duality.

Let's suppose we have a "base snapshot" and a diff or patch:

$ git checkout v1.0.7
$ patch -p1 < /tmp/patch-to-v1.0.7

If this patch applies cleanly—and we might expect it to, if it's generated from v1.0.7 of this software—we now have, in our Git repository, the patched source, ready to be used as a new snapshot:

$ git switch -c patch-branch
$ git add .
$ git commit -m "temporary: save the patch I got from Fred"

or whatever.

Meanwhile, once we make this snapshot, we can convert from snapshots to patch:

$ git format-patch --stdout patch-branch^ > /tmp/new-patch

and the new patch should be functionally identical to the original, even if the diff is a bit different for some reason.

Ultimately, given any two snapshots we can form a patch, and given any correct base and a patch we can form a new snapshot. This produces a simple Patch Algebra in which B P = S, or P = S - B.

This algebra has some issues though. Given a series of patches P1, P2, ..., Pn, plus a base B, we can make any number of snapshots. But some Pi can cancel out, or partially cancel out, some Pj. So we have to be very careful if we choose to re-order some patches in a set of patches, or drop or augment any patch.

Also, this all falls apart at merges, which have more than one parent. A merge commit M after some base plus a patch or patch-series, it's a base plus two patch-series plus one final "combine" operation, and in general, the "combine" operation is not representable as a simple patch. That's true even if the two patch-series both contain just a single patch: it's only guaranteed that the combine operation is null if one of the patch series is completely empty.

The git rebase command is Git's concession to this duality, as far as Git goes with it. Specifically, each commit-to-be-copied is converted to a patch, plus a little bit of extra information. That extra information consists of enough data to identify the specific base version of each file to be patched.

The simplest expression of this in Git is git cherry-pick. (An alternative is that obtained by git format-patch, which encodes the necessary extra information differently.) Given an ordinary (non-merge, single-parent) commit, git cherry-pick compares the snapshot in the commit to the snapshot in its parent. The result is the patch that git show or git format-patch would show: it's what's required to take the previous or parent snapshot (or "preimage") to the subsequent or child snapshot (or "postimage").

We then want to apply this same delta to some other existing snapshot, usually as stored in the current commit. Sometimes this is trivially easy, but sometimes it is hard. As it turns out, we want exactly the same extra information that we want during git merge operations. That is, the ideal way to apply this delta is to use a diff from that same parent commit to the current commit: this will tell us if some changes need to be moved up or down within some file due to differences in our own setup, or perhaps even moved to a completely different file name due to a rename operation, or whatever.

The git rebase command therefore literally uses git cherry-pick—or more precisely, in many versions of Git up until 2.13, git rebase -i was a shell script that ran git cherry-pick. Since then, the thing Git calls the sequencer (which implemented cherry-pick and revert) has been augmented to implement rebase, so now they're all one thing. In Git 2.23 or so—the release notes for Git make this a little ambiguous—even the non-interactive rebase was converted to use this method; older versions had a git-format-patch and git-am based back-end for non-interactive rebase. The old "am"-based rebase uses "index:" lines in git format-patch output to locate merge base versions of files, rather than using the commit directly, and this still happens for emailed patches.

Either way, though, Git uses the duality trick to convert snapshots to deltas whenever required. The merge operation uses the deltas, plus some shared base version, to "add incoming changes" to some existing set of changes. This uses our patch algebra, with its minor problem of internal cancellations, to get us as close as possible to what we want. (Sometimes the patch algebra cancellations are helpful and sometimes they're harmful, so overall it's a wash.)

In the end, git rebase, regardless of its specific implementation, is just a tool to get what we want. We have to decide what we want and whether this is a good tool. Sometimes there are better tools already.

  • Related