Home > Software engineering >  Git Pull results in new merge
Git Pull results in new merge

Time:08-26

I'm working on a new machine and for some reason now when I pull this creates a new merge commit for the changes I am pulling down. This has never happened to me before and I'm doing nothing different to before.

In SourceTree for example if I pull with 'Commit merged changes immediately' deselected this results in uncommitted changes which are the changes that were already merged onto our develop branch in the PR.

enter image description here

If I do the pull via Terminal I get this message:

Merge branch 'develop' of bitbucket.org:redacted/ into develop
# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.

So this tells me it's not a SourceTree issue but I have no idea what's going on.

Any help would be appreciated but please ELI5 as I'm no expert on Git.

CodePudding user response:

Let's start with this: git pull literally means run git fetch, then run git merge or some other second command of my choice. You should therefore expect git pull to make merge commits. The surprise should be that you didn't get merge commits before! I personally believe that Git newbies should not be told to use git pull. It's not that running the two commands that git pull combines for you is any easier, but rather, that when you do use the two commands, you get a better sense of what's going on.

What's going on

The first thing you should have been told about Git (but probably weren't) is that Git is not about files, and not about branches: Git is all about commits. Files come into the picture because commits would be useless without them, so each commit holds files. Branches—or more precisely, branch names—come into the picture because the true names of individual commits are horrible, and humans really don't want to (and should not) use them directly, most of the time. But Git is all about the commits, so we need to know what a commit is, and what it does for us.

Each commit:

  • Is numbered, with a huge and random-looking number, expressed in hexadecimal. For instance, 9bf691b78cf906751e65d65ba0c6ffdcd9a5a12c is a commit in a Git repository for Git. Git calls this a hash ID, or more formally an object ID or OID.

    The number here is unique: no other commit, in any Git repository anywhere in the universe, is ever allowed to use that number again. That number was first used for that commit, and if you have a commit with that number in your repository, it's required that it be that commit. Git somehow manages to do this across all Git repositories that exist anywhere in the universe, without having any of the repositories talk to any other repository.

    This trick is literally mathematically impossible to keep up forever, and someday Git will fail, but by making the hash ID huge, we can hope that the day will be so many billions of years in the future that we won't care because we're all long since dead.

  • Is read-only. This is required to make the numbering scheme work.

  • Holds two things:

    • Every commit holds (indirectly) a full snapshot of every file. These files are stored in a special, compressed, read-only, Git-only, de-duplicated format. This makes every commit act like an archive (a tar or WinRAR or zip file) of all files. The de-duplication trick means that most commits, which mostly re-use most of the files from previous commits, are tiny. Only all-new content for some file requires any space at all. Well, that plus the ...

    • Metadata: every commit holds some metadata, or information about the commit itself. This includes things like the name and email address of the person who made the commit. It also includes other stuff, e.g., some date-and-time stamps.

Git stores all these commits in a big database. This is a simple key-value store, where the hash ID of the commit is the key. So Git needs the hash ID to retrieve the actual commit.

Git adds something to this metadata for Git's own purposes: every commit stores, in this metadata, a list of previous commit hash IDs. This list is usually just one entry long, which makes the commit an ordinary commit. One other kind of commit is a root commit: one with no previous hash ID, i.e., and empty list of parents. The very first commit in any Git repository is a root commit (and often the only one). The third kind of commit is a merge commit, where the list has two or more parent hash IDs. But most commits are ordinary, single-parent commits.

When a commit stores the hash ID of some earlier commit, we say that the later commit points to the earlier commit. We can draw that as an arrow sticking out of a commit. Let's pick some ordinary commit's hash ID and just call it H for "hash" and draw that:

            <-H

This arrow points to the previous commit, which is also an ordinary commit and therefore has some hash ID and snapshot and parent. Let's call this one G:

        <-G <-H

Like H, G points to yet another, still-earlier commit:

... <-F <-G <-H

If H is the last commit in this chain—i.e., the commit we made most recently, just now—then the chain ends at commit H. It keeps going backwards from there, to commit G, then F, then E, and so on, until it finally reaches the very first commit—presumably A—which doesn't point backwards any further:

A--B--C--D--E--F--G--H

where we get a little lazy about drawing the arrows. (Humans sometimes wish they went forwards, instead of backwards, and we can be sloppy about drawing the commits because they're literally unchangeable and if we have the right ones, we know the arrows really only go backwards. It's obvious, after the fact, that they have to: we don't know what some future commit's hash ID will be—because of the time stamps, it literally depends on the exact second we wind up making the commit—but we know what some past commit's hash ID is.)

Branch names

The problem with this particular scheme is that to use Git this way, we'd literally have to memorize the hash ID for commit H. Git needs a hash ID to find a commit, in its big database full of commit objects and other supporting objects. It only really needs the last one, because H contains G's hash ID, so as long as we can tell Git how to find H, Git can use H to find G. Git can then use G to find F, and so on, all the way back to the beginning of the repository. But we need to write down, or otherwise memorize, H's hash ID.

But hold on: we have a computer. Let's add a second key-value database, next to the big one full of commits and other objects, where the keys are names, such as branch names, tag names, and other kinds of names. We'll have each name store exactly one hash ID. That's all we need, so we'll make it a crappy kind of database (well, that's what Git does anyway; this has become a problem and someday Git will have a better database here), where a branch name stores one hash ID, for the last commit.

Let's draw that in:

...--G--H   <-- main

We now have one branch name, main, pointing to commit H. We'll keep this arrow as an arrow, for reasons that should become obvious in a moment.

Now, to make Git more useful, we'll add more branch names. Let's create a branch named dev or develop or whatever, or feature, or whatever we like, maybe just br1. This name must point to some commit: that's how branch names are defined, in Git. Which commit should we pick? Probably the latest one, H:

...--G--H   <-- br1, main

We now have a minor issue: which branch name are we using? For the moment it doesn't seem to matter, but let's fix this issue by attaching the special name HEAD, written in all uppercase, to exactly one branch name:

...--G--H   <-- br1, main (HEAD)

This means we are "on" branch main, while both names hold H's hash ID.

Exercise: Which branch are the commits on, main, or br1? Warning: this is a trick question!

Git's answer is: these commits are on both branches. All the commits up through H are on two branches.

Let's add a third branch name, br3:

...--G--H   <-- br1, br2, main (HEAD)

All three names select commit H, and all the commits are now on three branches. If you're starting to think that branch names are kind of silly and the idea of being "on a branch" doesn't seem to mean anything, well, you're exactly right.

The point of a branch name is to be able to quickly find one specific commit. Git calls this commit the tip commit of that branch. So H is the tip commit of all three branches: right now, all three names find commit H. We're about to change this though!

Let's now switch to branch br1, with git switch br1 or git checkout br1. The result is this:

...--G--H   <-- br1 (HEAD), br2, main

Nothing much has changed yet: we're still using commit H. We're just using it through the name br1.

You already know how to make a new commit, so let's do that right now: we'll change some files around and git add and run git commit and get a new commit. This new commit will get some new, unpredictable, random-looking hash ID, but we'll just call it I, using the next letter.

Because we were using commit H, Git will make new commit I point backwards to commit H. Then, having made the new commit and obtained its hash ID in the process, Git will stuff the new commit's hash ID into the current branch name. The result looks like this:

          I   <-- br1 (HEAD)
         /
...--G--H   <-- br2, main

Commit I is on branch br1. Commit H is ... well, it's still on all three branches.

Let's make another new commit just for the heck of it:

          I--J   <-- br1 (HEAD)
         /
...--G--H   <-- br2, main

Name br1 now points to J, making J the tip commit of branch br1. J points back to I, I points back to H, and so on; and names br2 and main point to H, so that H is the tip commit of those two branches.

Now let's use git switch br2 to get "on" branch br2:

          I--J   <-- br1
         /
...--G--H   <-- br2 (HEAD), main

When we do this, we find that Git takes away the commit-J files (this is safe as they're all permanently archived in commit J) and replaces them with the commit-H files. This is one of the ways that Git makes commits and branch names useful: commit H is now the most recent br1 commit, so those are the files we have now.

When we make another commit in the usual way, our new commit gets a new hash ID, but we'll just call it K. New commit K points back to the commit we're using as we make it, i.e., to commit H:

          I--J   <-- br1
         /
...--G--H   <-- main
         \
          K   <-- br2 (HEAD)

We make yet another commit and get:

          I--J   <-- br1
         /
...--G--H   <-- main
         \
          K--L   <-- br2 (HEAD)

We have some major branching stuff going on now!

Merging

We can now run git merge. Let's copy the whole repository somewhere so that we can erase our work and start over, so that we can see how git merge is ... weird. But we'll start by merging br1 and br2.

Now, the goal of git merge is simple. The goal of a merge is to combine work. We did some work on br1, and we did some (probably different) work on br2. We want to combine this work. For concreteness, let's say we changed file f1.ext in I, f2.ext in J, f3.ext in K, and f1.ext (again, but differently this time) in L.

Every commit holds a snapshot of all files—an archive of the files as they were at the time we made the commit. So we could compare what's in commit J to what's in commit L to see what's different. If we do, we'll see that files f1 through f3 are different. But this doesn't really help. The difference between f3 in J and f3 in L is in fact what we changed, which is great, but the difference between f2 in J and f2 in L would back out what we did in J.

We could reverse the diff, comparing L to J. Now we'd get the right difference for f2, but for f3, the change would back out what we did in L! And in both cases, the difference in f1 doesn't work out right.

What we need is to somehow express what we changed in br1 as a diff from some point where all the files in both branches were the same. Is there such a point? Well, yes, of course. If you think about this for a moment, and look back at our drawings of commits above, you'll see that commits A through H are on both branches, and all the files in those commits are obviously the same on both branches because there's just the one commit A on both branches, and just the one commit B on both branches, and so on.

What we want to do—or more precisely, want Git to do—now becomes clear, or at least clear-er: we'll pick one shared commit, where the files are by definition the same in both branches, and have Git run git diff starting with this shared commit. We'd like the best such shared commit, and whether that's obvious or not, it turns out to be the rightmost (latest) such commit, i.e., commit H, just before the two branches forked.

So, we'll have Git run:

git diff --find-renames <hash-of-H> <hash-of-J>   # what changed on br1

The output from this diff will show our changes to files f1 and f2.

Then we'll separately have Git run:

git diff --find-renames <hash-of-H> <hash-of-L>   # what changed on br2

The output from this diff will show our changes to files f1 and f3.

Now we'll just have Git combine the changes.

That's it! It's pretty simple, no? Git just takes the line-by-line changes and combines them. As long as we changed different lines in file f1, Git can do this. (More precisely, the changes must not overlap, nor abut, unless they overlap exactly and do exactly the same thing.)

If all goes well, Git takes the combined changes and applies them to the snapshot from commit H. That combines our work: we keep our changes, and add theirs, whichever "side" of this we think of as ours and whichever one we think of as theirs.

The term Git uses for commit H is that it is the merge base commit. The git merge-base command lets you see which commit Git chose, but for many merges, you don't need to bother. Git does this all on its own after all.

The tricky cases come up when Git can't combine the changes on its own. We'll skip over these hard cases here, to keep this answer shorter (well, shorter than it would be if we covered them).

So, let's do a git merge now. We start by switching to whichever branch we want the merge commit to be "on" when we're done: the final commit is made like a regular git commit, with Git updating the branch name. If we want the merge commit to be "on" br1, we need to use git switch or git checkout to get there:

git switch br1

gives us:

          I--J   <-- br1 (HEAD)
         /
...--G--H   <-- main
         \
          K--L   <-- br2

The name main is about to be in the way, so I'm not going to draw it, but remember it's still there, pointing to H. We'll now run:

git merge br2

The argument br2 tells Git to find commit L, as pointed-to by name br2. As with most things in Git, this merge isn't really about branches but rather about the commits, and Git needs to locate commit L.

Git already knows we're "on" commit J, because HEAD is attached to br1 and br1 points to J. Git now uses the backwards arrows from commits J and L to work backwards to find the best shared commit. (Note that this would find H as the shared commit even if we had more, or fewer, commits on one branch than on the other. The actual algorithm for finding H is called the Lowest Common Ancestor algorithm.) So Git finds commit H as the merge base.

Git now runs the two git diffs and combines the diffs and applies the combined results to H's snapshot, to get a new snapshot. This new snapshot goes into a new commit, which we'll call M (for Merge, and also the next letter available, which is why I made two commits on each branch):

          I--J
         /    \
...--G--H      M   <-- br1 (HEAD)
         \    /
          K--L   <-- br2

The only thing special about commit M is that it has these two backwards arrows—or more precisely, that commit M contains, in its metadata, both hash IDs, for J and L. (The first entry in this list of two hash IDs matters somewhat, and it's always the commit that the branch name used to point to. But for now you probably don't have to care about this.)

This is a true merge. The result of a true merge is that the new commit's snapshot holds the merge result, as made by the merge code (or merge strategy), and the new commit points backwards to both of the previous tip commits. The other branch name—br2 in this case—doesn't change, and in fact you don't have to use a branch name at all, if you're willing to type in (or cut and paste) raw commit hash IDs, with the obvious risk of typing it in wrong.

Moreover, since the name br1 now lets Git find both commits J and L, by stepping backwards through both links from M, we're now safe to delete the name br2 if we want:

          I--J
         /    \
...--G--H      M   <-- br1 (HEAD)
         \    /
          K--L

Git needs commit L's hash ID to find commit L, but Git can get it from M, which Git can find from the name br1.

Non-merge "fast forward" operations

Let's go back to our saved repository (or otherwise get rid of M):

          I--J   <-- br1
         /
...--G--H   <-- main
         \
          K--L   <-- br2 (HEAD)

This time, instead of merging br1 and br2 with each other, let's run:

git switch main
git merge br1

so that we start with this as git merge fires up to find the merge base:

          I--J   <-- br1
         /
...--G--H   <-- main (HEAD)
         \
          K--L   <-- br2

Let's work backwards from both commits H and J, to find the best shared commit. If we look at J, it's only on br1, so that's no good. Let's move back one to I. It's still only on br1. Let's move back one more to H. Oh, look! Commit H is on both branches, so it's obviously the merge base.

Now let's run our two git diff commands:

git diff --find-renames <hash-of-H> <hash-of-H>   # what we changed on main

Wait a minute... this git diff output is empty! Of course it's empty, commit H is identical to commit H. It's always going to be empty if the merge base is the current commit.

The other diff won't be empty:

git diff --find-renames <hash-of-H> <hash-of-J>   # what they changed on br1

and the result of combining these two sets of differences will be the second set of differences. If we go ahead and combine them and make a new commit M, we get this:

          I--J   <-- br1
         /    \
...--G--H------M   <-- main (HEAD)
         \
          K--L   <-- br2

But the snapshot in M will exactly match the snapshot in J. Git says to itself: Gee, this kind of merge is stupid, why don't I cheat? And then Git cheats and does this instead:

          I--J   <-- br1, main (HEAD)
         /
...--G--H 
         \
          K--L   <-- br2

That is, git merge doesn't bother to do any actual merging, because the snapshot it will get is simply the snapshot in commit J. Instead, Git drags the name main "fast forward" to point to commit J directly.

This kind of "fast forward" operation—presumably named after the "fast forward" button on an old style tape recording device, which we now see as an icon on music and video players—is not really a merge at all. In fact, Git can do it for operations that don't involve running git merge (specifically, git push and git fetch operations). But when Git does it because you ran git merge, Git calls this a fast-forward merge.

Not all merges can be fast-forward merges. Now that main and br1 both point to commit J, a git merge br2 has to do a real merge.

There's one more case we should cover here, when git merge says already up to date and does nothing. Given:

          I--J   <-- br1 (HEAD)
         /
...--G--H   <-- main
         \
          K--L   <-- br2

we'd get this message—it's not exactly an error, just information—if we ran git merge main. Git would use its usual algorithm to find the merge base, which is commit H. But that merge base commit is behind the current commit J. Git can't move br1 forward to H: moving br1 to H would be backwards, and would "lose" commits I-J (without a name to find J's hash ID, commit J gets "lost", and J is how we find I). So you just get this message.

In summary

  • Merging combines work.
  • Some git merge operations have nothing to combine. These either do nothing ("already up to date") or use the fast-forward trick to avoid doing anything hard, if you let them.

Adding --no-ff to your git merge command tells git merge: No cheating! Make a real merge even if you could cheat! Adding --ff-only tells git merge: No doing any hard work! Merge only if you can cheat! We won't worry about why you might want either of these just yet, just note that the flags exist. If you don't use them, the default is cheat if possible, otherwise do the hard work.

Multiple Git repositories: git clone and git fetch

Let's look now at how git fetch works, starting with git clone. We have some repository somewhere, maybe on Bitbucket or GitHub or whatever, and we'd like to do some work. Git requires that we make a new repository, which we do with git clone. This:

  1. makes a new, empty directory for the repository: this is where our working tree will go, i.e., where we'll have files we can actually work on/with;
  2. runs git init in this new empty directory to create a .git subdirectory that contains the two databases;
  3. adds a remote named origin using git remote add origin url, which saves the URL for future use;
  4. runs any additional configuration steps (usually none: this depends on options you give to git clone);
  5. runs git fetch origin, which we'll explain in a moment; and
  6. runs an initial checkout to create one branch name and get you started with some files from the selected commit.

The git fetch step—step 5 here—copies all the commits (or all the useful, find-able ones anyway) from the other Git repository, whose URL is saved under the name origin. These commits, and all the necessary supporting objects, go into your new, initially-empty repository, in the objects database. It does not copy their branch names though. Your branch names, which will go into your names database here, are yours, not theirs. Instead, it takes each of their branch names, like main and develop for instance, and renames them.

The renamed names get origin/ shoved in front. Git calls these remote-tracking branch names, but I just call them remote-tracking names because they're not actually branch names. They are your Git's method of remembering the other repository's branch names. If they have a main, you now have an origin/main, that selects the same commit as their main. If they have a develop, you now have an origin/develop.

When the fetch step is done, you have one remote-tracking name for each of their branch names. You have no branch names of your own yet! But now we get to step 6. Your Git will create one new branch name. Which one? Well, that's up to you: you supply the name with your -b option, when you run git clone. If you didn't supply a -b option, your Git software asks their (origin's) Git software which branch name they recommend. Whatever name that is—usually main or master—that's the name your Git creates. The commit you get as the tip commit for this branch is the same commit that their name selected.

This is a pretty long way around to get your own main pointing to commit H, just like their main, which is now your origin/main, but that's what git clone does. If they had their main at commit H, well, you now have all their commits, and one branch of your own, also at commit H:

...--G--H   <-- main (HEAD), origin/main
         \
          I--J   <-- origin/develop

Note that you don't have a develop, but you do have an origin/develop that is just as good at finding commit J.

What you can't do is:

git switch origin/develop   # error!

You can't do this because origin/develop is a remote-tracking name, not a branch name. If you want your own develop, you tell your Git to make one, and then switch to that:

...--G--H   <-- main (HEAD), origin/main
         \
          I--J   <-- develop, origin/develop

and then git switch develop removes the commit-H files and puts in the commit-J files and you have:

...--G--H   <-- main, origin/main
         \
          I--J   <-- develop (HEAD), origin/develop

You may now make your own commits. Let's say you do make one:

...--G--H   <-- main, origin/main
         \
          I--J   <-- origin/develop
              \
               K   <-- develop (HEAD)

You figure you're all set... but while you have been making commit K, maybe someone else was making new commits too!

You should now run git fetch origin (or just git fetch) to find out. Your Git software will call up their (origin's) Git software and connect to their repository and see if they have any new commits. If so, your Git will download the new commits and supporting objects—just the new stuff; re-used files in new commits don't need any downloading—and add that to your objects database. Then your own Git software will update your own remote-tracking names, based on their updated branch names:

...--G--H   <-- main, origin/main
         \
          I--J--L   <-- origin/develop
              \
               K   <-- develop (HEAD)

Oh look! They've added a new commit to their develop, so that your origin/develop points to new commit L. But hang on a moment. Let's draw just commits J through L, like this:

       K   <-- develop (HEAD)
      /
...--J
      \
       L   <-- origin/develop

This is a branch, sort of. It's the kind of branch we mean when we mean "some commits that diverge". Sometimes the word branch means branch name, sometimes it means some set of commits, and in this particular case it means some set of commits that diverge and maybe need merging.

The git pull command

Now that we understand fetch and merge, we can finally understand git pull, in its normal "fetch and the merge" mode. We have:

       K   <-- develop (HEAD)
      /
...--J   <-- origin/develop

and we run git pull, which means:

  1. run git fetch with appropriate arguments; then
  2. run git merge with appropriate arguments

The git fetch step fetches from origin and results in:

       K   <-- develop (HEAD)
      /
...--J
      \
       L   <-- origin/develop

The git merge step merges with commit L, now found via origin/develop, which requires a true merge. So that's what Git does:

       K
      / \
...--J   M   <-- develop (HEAD)
      \ /
       L   <-- origin/develop

You get exactly what you're seeing: Atlassian Sourcetree just draws the graph vertically, with newer commits towards the top, rather than horizontally, with newer commits towards the right.

Before, when you hadn't made a new commit, you had:

...--J   <-- develop (HEAD), origin/develop

and you ran git pull and if they had any new commits at all it looked like this:

...--J   <-- develop (HEAD)
      \
       K   <-- origin/develop

and git merge could and did cheat and gave you:

...--J
      \
       K   <-- develop (HEAD), origin/develop

with no new merge commit. We can draw all this on one line, without any branching-and-merging, and Atlassian Sourcetree does so.

What can you do about this?

You can use git rebase instead of git merge. There are a lot of caveats to this, and this answer is already very long, so I'm not going to get into any of the details. Rebase is a sharp tool that must be used carefully and correctly: it's fundamentally easier to use git merge correctly (though it too is kind of a sharp tool, but git rebase effectively does multiple repeated internal merge operations, so that alone makes it more dangerous).

You can also program Git (well, "configure" it—same thing in some ways) to use git rebase as its second git pull command, instead of using git merge. Or, you can set it up so that it uses git merge --ff-only, which we mentioned earlier without giving a motivation. The main motivation here is that if git pull can do a fast-forward, that's almost always what you want it to do. So configuring it to do that gets you a much more useful version of the git pull command. (It probably should have been set up this way originally, way back in 2005, but it wasn't.)

My own preference, developed back in the mid aughts after getting burned multiple times by the then-somewhat-buggy git pull, is to not run git pull at all. I run git fetch, see what git fetch fetched, and then choose my commands. I'm slowly warming to a git pull with ff-only set, at least for some projects, though.

CodePudding user response:

I suspect that your old machine had pull.rebase true and the new machine doesn't.

I suggest running git config --list --show-origin and examining config files your old machine for config settings that you want to migrate to your new machine.

  • Related