Home > Enterprise >  Git rebase without having most recent changes in workspace
Git rebase without having most recent changes in workspace

Time:03-31

Today I was struggling with the following scenario (note: all branches were pushed into the remote repository):

  • Branch X was forked from branch Z and have A commits ahead. I did git pull and had the most recent changes in my workspace.
  • Branch Y was forked from branch Z and have B commits ahead. I did only git merge and missed git pull by mistake.

While I was trying to rebase X onto Y using GitHub Desktop, I was getting the following message: "This will update X by applying its C commits on top of Y". However, A != C, and I spend some time googling to understand where the number C was comming from.

Later on, I realized that I was missing the git merge in branch Y. After executing it, GitHub Desktop rebase tool gave the same message with C == A.

I am not sure from where the number C come from, nor why C == A after the git merge command. Any hints?

CodePudding user response:

It's hard to say for certain without exact concrete details why you got the particular results you got. But there's a general rule you can use here: git rebase is about copying (some) commits to new-and-improved (or supposedly-improved) commits. That is, you already have some existing commits, but there's something you don't like about those commits. That could include any combination of the following (or anything else you might find objectionable about your commits):

  • one of the commit messages has a typo, and/or
  • one of the commits' changes has a bug, and/or
  • all the commits are good in terms of messages and/or changes, but they start from a commit you don't want them to start from: you want them to start from some other commit.

To understand how this works, let's start with a quick review of the basics of commits and branch names. Feel free to skip this section if you're already familiar with this part.

Review of basics

Every commit:

  • is numbered, with a big ugly hash ID that appears to be random (but isn't) that is unique to that particular commit;
  • is read-only: the hash ID is actually a cryptographic checksum of the contents, so you can't change a commit, you can only take it out and use it to make a new one (a "copy" with at least one change made to it), which will get a different hash ID once made;
  • contains two parts: a full snapshot of every file, plus some metadata.

The snapshot holds each file in Git's internal, compressed, read-only and de-duplicated form, so that if the contents of any one file in any one commit exactly match the contents of any other file in any commit (including this same one), there's just one copy of those contents. That makes it free to repeatedly commit the same files over and over, as there's really only one copy of each file.

The metadata of each commit holds things like the name and email address of the person who made the commit, some date-and-time-stamps, and so on. Included in this metadata is a list of previous commit hash IDs. Usually this list is just one entry long, and the one entry in this list is the commit's parent (singular). With ordinary commits, this single parent hash ID results in a backwards-looking chain of commits, which we can draw.

Suppose the latest commit (on some branch) has a hash ID we'll call H for short. Commit H contains a snapshot and metadata, and the metadata for H includes the hash ID of some earlier commit, which we'll call G for short. Commit H therefore points to earlier commit G:

          G <-H

But G is a commit, so it has metadata, which points to some earlier commit F, which also is a commit, so it has metadata, which ... well:

... <-F <-G <-H

This chain extends on forever, backwards, or rather, extends backwards until we hit the very first commit ever, which—being first—can't point backwards and just doesn't:

A--B--...--G--H

(assuming just eight commits in the entire repository).

To quickly find that last commit hash ID, Git uses a branch name. Your branch's name, whatever that name is—let's call it main for now—contains the actual raw hash ID of commit H. So the branch name points to H, at the moment:

...--G--H   <-- main

If you have more than one branch name, each name points to one particular commit. That commit is the last commit of this branch, whatever this name is. So, given:

...--G--H   <-- develop, main

we know that commit H is the last commit of both branches. All the commits are on both branches.

Once we check out (or git switch to) one of these two branches, we're "on" that particular branch. Git remembers which branch we are "on" by attaching the special name HEAD to one branch name:

...--G--H   <-- develop, main (HEAD)

Here we're on branch main, as git status would say. We're using commit H, but we're using it via the name main. If we run:

git switch develop

we get:

...--G--H   <-- develop (HEAD), main

We are still using commit H, but now we're using it via the name develop.

The setup that causes us to want to rebase

Without worrying about all the details about how we make new commits, let's now make two new commits "on" develop. The first one, which we'll call commit I, will point back to existing commit H, and Git will update the current branch name so that develop now points to I instead of H:

          I   <-- develop (HEAD)
         /
...--G--H   <-- main

The second new commit J will point back to what was current commit I when we made J, and Git will update develop to point to J:

          I--J   <-- develop (HEAD)
         /
...--G--H   <-- main

Now, for whatever reason and with whatever process, we cause our own Git to add a new commit K to branch main. Perhaps we run git switch main and then git pull (which brings over some new commit K and adds it on) and then we git switch develop again, but in any case we now have:

          I--J   <-- develop (HEAD)
         /
...--G--H--K   <-- main

We now decide we like everything about commits I and J, in terms of the changes they make to commit H and then I and in terms of the log messages we put in them. But we don't like the fact that they spring from commit H. We'd prefer to have them spring from commit K. That is, we want our picture to look like this:

          I--J   [abandoned]
         /
...--G--H--K   <-- main
            \
             I'-J'  <-- develop (HEAD)

Commit I' is a new and improved variant of I: it has the same changes to K that I has when we compare I with H, and it has the same log message (and author and committer and so on) as I. But it necessarily has a different hash ID, which makes it I' instead of I. Then commit J' makes the same change to I' that J makes to I, and has the same log message and so on as original commit J. But commit J' has a different hash ID, because it's a different commit, with parent I', and commit I' points back to commit K. That's exactly what we want!

Since we've abandoned the original I-J sequence, and we find commits by having Git start from the branch name and work backwards, we now see only our copied commits. It is as if commits I and J were somehow magically changed. They weren't: they're actually still there, in the repository, and we can see them if we can just find the hash ID of J somehow.1

That's the motivation for a rebase. Now let's take a look at the mechanism.


1Git's reflogs make this easy, but you don't normally see the reflog contents, so you don't normally see the old semi-abandoned commits. Eventually each reflog entry that remembers an otherwise-abandoned commit expires, though, and then Git may eventually discard the commit for real. In a normal everyday repository, this takes at least a month by default.


How git rebase works on a granular level

To actually do a rebase, Git needs to:

  1. List out the raw hash IDs of the commits to copy.
  2. Choose a place to put the copies, and check out that commit (as a "detached HEAD").
  3. Copy each to-be-copied commit, one by one, using git cherry-pick or something equivalent.
  4. Move the branch name that we were on when we started the whole thing.

(There's an optional step 0, "switch to some other branch", which affects step 4 too, and it has what I consider a bug so awful that you should never use step 0: it leaves you "on" the branch it switches to. That is, if you run this kind of rebase, it doesn't matter what branch you're on when you run git rebase torek-does not-recommend-this. Instead, Git switches to not-recommend-this and then runs git rebase and you end up on branch not-recommend-this. That's too confusing, so don't do that. Run your own git switch or git checkout command yourself as "step 0". But if you personally don't find it confusing, feel free to use it.)

Let me briefly touch on git cherry-pick. I noted above that each commit is a snapshot. It's not a set of changes! And yet, ordinary (non-merge) commits get shown as changes. (Try it: run git show to see your current commit, shown as changes since its parent, or git log -p to see each commit shown as changes. Note that git log -p doesn't bother to show merges as changes: that's too hard.)

Git will show you changes by simply extracting, to two temporary areas (in memory really), the two commits. That is, if we're on commit J:

          I--J   <-- develop (HEAD)
         /
...--G--H--K   <-- main

and we run git show, Git extracts the snapshots for commits I and J. For all the files that are the same in these two snapshots, Git does nothing at all (and this goes very fast due to the internal de-duplication: Git sees that file README.txt, say, in both I and J are sharing one underlying copy and doesn't even bother extracting it at all). For the files that are different, though, Git takes the two extracted copies and compares them, line-by-line, playing a sort of game of Spot the Difference. Git then shows you what changed in that file.

Both commits hold a snapshot, but you see a "diff", as if commit J held changes-since-commit-I. It doesn't: what you see is a sort of optical illusion or mirage. Git does this because humans find this view more useful than the true view-as-snapshot.

What git cherry-pick does is to use Git's merge machinery to copy some view-as-a-diff from some commit-pair, such as H-vs-I, to some snapshot in some other commit, such as K. For space reasons, we'll skip all the details except to say that in terms of git merge, this is a pretend merge with commit H as the merge base and commit I as the --theirs commit, with commit K as the --ours commit. This explains why "ours" and "theirs" seem reversed during rebase. (They sort of are, and sort of are not, and it's complicated overall.)

Anyway, let's go back to our diagram:

          I--J   <-- develop (HEAD)
         /
...--G--H--K   <-- main

The commits we want Git to copy are I and J, and the place we want Git to put the copies is "after K.

The way we specify both of these things to git rebase is to run:

git rebase main

while we're "on" branch develop. Git runs an internal equivalent of:

git log main..develop

to find the hash IDs for commits I and J (they come out backwards if you do this, so Git actually uses git rev-list --reverse --topo-order and a bunch of other magic to fix this and to get other special tricks done). Now Git has the list of hash IDs, which it saves somewhere, in a file (because git rebase may need to quit and then start up again later).

Having listed out the commits to copy, Git then does the internal equivalent of:

git switch --detach main

which gets us this as our picture:

          I--J   <-- develop
         /
...--G--H--K   <-- main, HEAD

The special name HEAD is no longer attached to any branch. Instead, it points directly to the commit we have checked out.

Git now runs git cherry-pick hash-of-I, or something more or less equivalent. This copies the H-vs-I changes into our working area and into Git's index and uses the updated files to run an internal git commit. This internal commit re-uses the author and log message information from commit I (via the saved hash ID, again), which makes new commit I':

          I--J   <-- develop
         /
...--G--H--K   <-- main
            \
             I'  <-- HEAD

Once that's one, Git runs git cherry-pick hash-of-J, which copies J to J':

          I--J   <-- develop
         /
...--G--H--K   <-- main
            \
             I'-J'  <-- HEAD

All the copying is now done, and Git just needs to yank the name develop off commit J and make it point to J' instead. To do that, Git uses the internal equivalent of git branch -f develop HEAD, resulting in:

          I--J   [abandoned]
         /
...--G--H--K   <-- main
            \
             I'-J'  <-- develop, HEAD

and then does an internal git switch develop (probably during the internal git branch -f step: they can be combined as git switch -C) to re-attach HEAD, giving:

          I--J   [abandoned]
         /
...--G--H--K   <-- main
            \
             I'-J'  <-- develop (HEAD)

That's the rebase we asked for; Git has done it.

While I was trying to rebase X onto Y using GitHub Desktop ...

GitHub Desktop isn't command-line Git and may do its own things; only someone familiar with GitHub Desktop can say exactly what it will do. But if it doens't automatically reach out to GitHub, so that it winds up doing the same thing regular Git will do here, it will do what regular Git does:

git switch X
git rebase --onto Y <upstream>

For the git rebase main case, I did not use the --onto flag. We saw that in that case, Git did:

git log main..develop

Suppose we had this to start with, though:

            J   <-- feature2 (HEAD)
           /
          I   <-- feature1
         /
...--G--H--K   <-- main

We've decided that commit J is independent of commit I, and we'd like to copy J to a new and improved J' that comes after K.

If we run:

git rebase main

we'll copy commits I and J. That's too many. We only want to copy J. How do we tell Git to run:

git log feature1..feature2

so that it only finds J, and yet then run:

git switch --detach main

so that J gets copied after K? The answer is that we use --onto:

git rebase --onto main feature1

This separates out the "what to copy / not to copy" part—in this case, feature1..feature2—from the "where to put the copies" part.

The --onto name (or commit hash ID) specifies where to put the copies. That frees up the other name to be different: feature1, in our case.

Now Git will list out just the one commit, copy the one commit, and yank the branch name around:

            J   [abandoned]
           /
          I   <-- feature1
         /
...--G--H--K   <-- main
            \
             J'  <-- feature2 (HEAD)

and we get just what we want.

A few other tricky things to be aware of

When you have merge commits in your own repository graph, like this:

          I--J
         /    \
...--G--H      M   <-- develop (HEAD)
         \    /
          K--L   <-- main

and you choose to run git rebase main at this point, your Git will discard the merge commit entirely. That is:

git log main..develop

will show commits I, J, and M, but commit M is a two-parent commit, i.e., a merge commit. The git cherry-pick command cannot copy a merge commit,2 and so git rebase does not even try. The rebase command simply leaves out the merge. The result is usually what you want anyway:

          I--J
         /    \
...--G--H      M   [abandoned]
         \    /
          K--L   <-- main
              \
               I'-J'  <-- develop (HEAD)

which, when viewed without seeing the abandoned merge and top line, looks like:

...--G--H--K--L   <-- main
               \
                I'-J'   <-- develop (HEAD)

(verify for yourself that these are the same drawings, provided you never look at the three commits we've abandoned!).

Besides not copying any merge commit, git rebase will also omit other commits:

  • any commit that seems to have already been copied into the "upstream" is omitted (Git uses git patch-id to decide on these), and
  • depending on how you run the git rebase command, Git may make use of git merge-base --fork-point to choose which commit to use as the first commit to copy, rather than using the result of git log upstream..HEAD.

The fork-point stuff gets complicated; see, e.g., Git rebase - commit select in fork-point mode.


2It can, however, fake it, using the -m option. The -m option tells git cherry-pick to pretend that the merge commit has just a single parent (whose "parent number" you specify) and Git then uses the parent as the pseudo-merge-base for the cherry-pick operation. The git rebase command never uses this mode, though, not even with --rebase-merges.

  • Related