I am learning GitHub and found downstram/upstream concepts for forked repositories. I also went through various documentations/blogs but was not able to clear that what actually happens when we press 'fetch upstream'?
CodePudding user response:
git fetch upstream
fetches ("downloads") all the changes from the remote repository upstream
and stores them locally with the upstream
. You can then refer to these local copies (e.g., check out to them, set up tracking branching, cherry pick commits, etc) with this prefix. E.g., git checkout upstream/some-branch
would check out to the local copy of some-branch
you just fetched from upstream
.
CodePudding user response:
To understand this properly, you need to know the following about Git:
- Git is all about commits. It's not about files (though commits contain files). It's not about branches (though branch names help us, and Git, find commits). It's really about the commits.
- Commits are numbered. The numbers are huge, ugly, random-looking things expressed in hexadecimal; each commit gets a unique number, different from every other commit in every Git repository everywhere. If two different Git repositories have the same commit number in them, they have the same commit in them: the number is the commit, in a sense (though you have to have the commit itself: the number is just the key, in the key-value database, that Git uses to look up, i.e., find, the commit).
- Besides branch names like
main
ormaster
,dev
,feature/tall
, etc., Git has other names: tag names likev3.14
, and things called remote-tracking names (Git actually calls these remote-tracking branch names, but I find that it makes more sense if you leave out the unnecessary word branch here). Each name gets to store one (1) hash ID. That's all we need, because commits also store hash IDs.
When we clone a Git repository, we get all of the other repository's commits and none of their branches.1 Instead of branch names, our Git takes the other clone's branch names and turns them into our remote-tracking names. If we call the other Git—the one we're cloning now—origin
, which is the standard first remote name, their main
turns into our origin/main
, their dev turns into our
origin/dev`, and so on.
What this means is that our branch names are ours. We don't have to use the same names as some other Git repository. We usually do, just for sanity, but we don't have to.
This also tells us, indirectly, what a "remote" is: a remote is a short name that stores a URL—the URL we're cloning from, for origin
—and also provides a prefix for the remote-tracking names. The origin
in origin/dev
comes from the remote name origin
.2
When you run:
git fetch origin
your Git software, working in your repository, calls up some other Git software somewhere—at the URL stored under the name origin
—and has it connect to some other repository using that URL. That other software (the "other Git", as it were) reads out their commits—specifically the hash IDs—and branch names and sends them to "our Git" (our software working in our repository). Our Git and their Git have a mini-conversation involving the hash IDs, so that our Git can see what commits they have, that we don't.
Our Git will then bring over any (new-to-us) commits they have, that we don't. That includes any commits we manually, carefully discarded from our Git repository because we found they were bad in some way:3 so in this respect, it's like having Git-sex with a Git that may be carrying some virus, and we'll just keep getting re-infected until they also ditch that bad commit. But mostly this is good since mostly we do want every commit they have, that we don't.
But: what about upstream
? Well, there's a minor problem with this word, upstream, because Git uses this same word to mean something else.4 But in this case, upstream
is the name GitHub in particular encourage people to use as the second remote in their Git repositories. We can have more than one remote!
Using git remote add upstream url
, we create a second remote named upstream
. After that:
git fetch upstream
uses the saved URL to call up some other Git, just like git fetch origin
does. Whether the hosting site is GitHub, or some other site, our Git and their Git have the same kind of conversation as before. Our Git will find out which commits they have that we don't, download those commits into our Git repository, and create or update remote-tracking names like upstream/main
and upstream/dev
. We'll get one upstream/*
name for each branch name in the other git at the URL stored under the name upstream
.
That's almost all there is to it. There is one particular point that trips people up here though. Suppose you git clone
a repository, so that you now have origin/main
and origin/feature/tall
. But the origin
repository is forked from some other repository, so you use git remote add
to add your fork2
or upstream
or whatever you want to call it, and then you run:
git fetch fork2
or whatever you called it. You now have fork2/main
and fork2/feature/tall
. So you have both origin/feature/tall
and fork2/feature/tall
.
You have not yet made your own feature/tall
. You run:
git switch feature/tall
or:
git checkout feature/tall
expecting your Git to create your feature/tall
from ... well, wait: are you expecting your new branch name, feature/tall
, to spring from origin/feature/tall
and use that as its upstream setting? Or are you expecting your new branch name, feature/tall
, to spring from fork2/feature/tall
and use that as its upstream? Or perhaps you need two feature/tall
branches, one to go with origin/feature/tall
and one to go with fork2/feature/tall
.
You can't call both feature/tall
. This means that if you do want two branch names, one for each remote-tracking name, you will be forced to break the usual "my name = my remote-tracking name, minus the remote" setup that you're used to. The bottom line is that as soon as you have two or more remotes, your Git life gets more complicated. There's no way around this: you must understand what remotes, and remote-tracking names, are and do.
1You can modify this behavior somewhat at git clone
time, and there are usually trash and/or dropped commits in repositories that get cleaned up by maintenance commands later and git clone
normally doesn't copy those. So this is just an approximation, useful for understanding things.
2As usual with Git, the process by which git fetch origin
results in their dev
becoming origin/dev
is not straightforward at all. You can do all kinds of crazy things with this. For sanity, though, it's not wise to do anything weird and wild here in any normal user clone: just let their dev
become your origin/dev
.
3Perhaps, for instance, we carefully discarded an accidental commit that added a terabyte database that was clogging up the disk. Oops, here it is again!
4In particular, Git allows each branch name to store a single upstream name. Usually we'll set the upstream of branch br1
to origin/br1
: the remote-tracking name at origin
that corresponds to their branch name br1
. That way our branch name br1
can easily refer to our origin/br1
, which is our copy—our Git's memory—of their branch name br1
.
This is not at all the same as a remote named upstream
. If GitHub encouraged people to use, as the second remote name, fork2
or similar, that might help.