What happens when we 'fetch upstream' a forked repository?-CodePudding

I am learning GitHub and found downstram/upstream concepts for forked repositories. I also went through various documentations/blogs but was not able to clear that what actually happens when we press 'fetch upstream'?

CodePudding user response：

git fetch upstream fetches ("downloads") all the changes from the remote repository upstream and stores them locally with the upstream. You can then refer to these local copies (e.g., check out to them, set up tracking branching, cherry pick commits, etc) with this prefix. E.g., git checkout upstream/some-branch would check out to the local copy of some-branch you just fetched from upstream.

CodePudding user response：

To understand this properly, you need to know the following about Git:

Git is all about commits. It's not about files (though commits contain files). It's not about branches (though branch names help us, and Git, find commits). It's really about the commits.
Commits are numbered. The numbers are huge, ugly, random-looking things expressed in hexadecimal; each commit gets a unique number, different from every other commit in every Git repository everywhere. If two different Git repositories have the same commit number in them, they have the same commit in them: the number is the commit, in a sense (though you have to have the commit itself: the number is just the key, in the key-value database, that Git uses to look up, i.e., find, the commit).
Besides branch names like main or master, dev, feature/tall, etc., Git has other names: tag names like v3.14, and things called remote-tracking names (Git actually calls these remote-tracking branch names, but I find that it makes more sense if you leave out the unnecessary word branch here). Each name gets to store one (1) hash ID. That's all we need, because commits also store hash IDs.

When we clone a Git repository, we get all of the other repository's commits and none of their branches.¹ Instead of branch names, our Git takes the other clone's branch names and turns them into our remote-tracking names. If we call the other Git—the one we're cloning now—origin, which is the standard first remote name, their main turns into our origin/main, their dev turns into our origin/dev`, and so on.

What this means is that our branch names are ours. We don't have to use the same names as some other Git repository. We usually do, just for sanity, but we don't have to.

This also tells us, indirectly, what a "remote" is: a remote is a short name that stores a URL—the URL we're cloning from, for origin—and also provides a prefix for the remote-tracking names. The origin in origin/dev comes from the remote name origin.²

When you run:

git fetch origin

your Git software, working in your repository, calls up some other Git software somewhere—at the URL stored under the name origin—and has it connect to some other repository using that URL. That other software (the "other Git", as it were) reads out their commits—specifically the hash IDs—and branch names and sends them to "our Git" (our software working in our repository). Our Git and their Git have a mini-conversation involving the hash IDs, so that our Git can see what commits they have, that we don't.

Our Git will then bring over any (new-to-us) commits they have, that we don't. That includes any commits we manually, carefully discarded from our Git repository because we found they were bad in some way:³ so in this respect, it's like having Git-sex with a Git that may be carrying some virus, and we'll just keep getting re-infected until they also ditch that bad commit. But mostly this is good since mostly we do want every commit they have, that we don't.

But: what about upstream? Well, there's a minor problem with this word, upstream, because Git uses this same word to mean something else.⁴ But in this case, upstream is the name GitHub in particular encourage people to use as the second remote in their Git repositories. We can have more than one remote!

Using git remote add upstream url, we create a second remote named upstream. After that:

git fetch upstream

uses the saved URL to call up some other Git, just like git fetch origin does. Whether the hosting site is GitHub, or some other site, our Git and their Git have the same kind of conversation as before. Our Git will find out which commits they have that we don't, download those commits into our Git repository, and create or update remote-tracking names like upstream/main and upstream/dev. We'll get one upstream/* name for each branch name in the other git at the URL stored under the name upstream.

That's almost all there is to it. There is one particular point that trips people up here though. Suppose you git clone a repository, so that you now have origin/main and origin/feature/tall. But the origin repository is forked from some other repository, so you use git remote add to add your fork2 or upstream or whatever you want to call it, and then you run:

git fetch fork2

or whatever you called it. You now have fork2/main and fork2/feature/tall. So you have both origin/feature/tall and fork2/feature/tall.

You have not yet made your own feature/tall. You run:

git switch feature/tall

or:

git checkout feature/tall

expecting your Git to create your feature/tall from ... well, wait: are you expecting your new branch name, feature/tall, to spring from origin/feature/tall and use that as its upstream setting? Or are you expecting your new branch name, feature/tall, to spring from fork2/feature/tall and use that as its upstream? Or perhaps you need two feature/tall branches, one to go with origin/feature/tall and one to go with fork2/feature/tall.

You can't call both feature/tall. This means that if you do want two branch names, one for each remote-tracking name, you will be forced to break the usual "my name = my remote-tracking name, minus the remote" setup that you're used to. The bottom line is that as soon as you have two or more remotes, your Git life gets more complicated. There's no way around this: you must understand what remotes, and remote-tracking names, are and do.

¹You can modify this behavior somewhat at git clone time, and there are usually trash and/or dropped commits in repositories that get cleaned up by maintenance commands later and git clone normally doesn't copy those. So this is just an approximation, useful for understanding things.

²As usual with Git, the process by which git fetch origin results in their dev becoming origin/dev is not straightforward at all. You can do all kinds of crazy things with this. For sanity, though, it's not wise to do anything weird and wild here in any normal user clone: just let their dev become your origin/dev.

³Perhaps, for instance, we carefully discarded an accidental commit that added a terabyte database that was clogging up the disk. Oops, here it is again!

⁴In particular, Git allows each branch name to store a single upstream name. Usually we'll set the upstream of branch br1 to origin/br1: the remote-tracking name at origin that corresponds to their branch name br1. That way our branch name br1 can easily refer to our origin/br1, which is our copy—our Git's memory—of their branch name br1.

This is not at all the same as a remote named upstream. If GitHub encouraged people to use, as the second remote name, fork2 or similar, that might help.