Home > Back-end >  Workflow when working on GitHub fork for PR requests
Workflow when working on GitHub fork for PR requests

Time:07-08

Know there is a lot of Q/A on this, but I'm still very uncertain on how to proceed.


(Based on a real story!)

Say there is a public project named unicorns by danny. To make pull requests one are to work from own fork of project.

Basic setup

So one do a fork on GitHub's web-site to ones own profile.

Then locally get a clone of that one set it up with the project to get updates:

$ git clone https://github.com/MyUser/unicorns.git
$ cd unicorns
$ git remote add danny [email protected]:danny/unicorns.git

To get an up to date local copy:

$ git checkout main
$ git pull danny main

Creating pull requests

Then one gets to work. Starting with creating a branch:

$ git checkout -b my_work_1

# Do some changes and commit locally
$ git commit -am "I changed this"

# Push the changes to ones copy on GitHub
$ git push -u origin my_work_1

Then proceed with creating a PR from the GitHub website.

Then do a second PR they wanted right away:

# Check out main as not to include my_work_1 in this branch:
$ git checkout main

# Create new branch for second work
$ git checkout -b my_work_2

# Do some changes and commit locally
$ git commit -am "I changed this as well"

# Push the changes to ones copy on GitHub
$ git push -u origin my_work_2

Then proceed with creating a PR from the GitHub website.


Trouble starts

So far so good. (I hope, lol)

The PR's are accepted and merged into main of the project.

But then next day:

$ git checkout main
$ git pull danny main

Now it say my main branch is ahead by 40 commits. In my local tree I see something like:

  main-remotes/danny/unicorns Last thing done
  Some commit
  Some commit
: .. 35 more
  My commit work 2  (No 39)
  My commit work 1  (No 40)
|/ Branch my_work_2
|/ Branch my_work_1
  remotes/origin/main Some commit
  Some commit
:

There seems to be as many solutions as questions on this. I am wondering what is going on and how to proceed. Have read a lot of the Q/A's on the topic etc.

I have a myriad of questions but the gist of some:

  1. Did I do something wrong above?

  2. Is it my two local branches my_work_1 and my_work_2 that is the reason for the message? Hasn't those been merged (or smash merged as some said) into the main of the real repository?

  3. Do I have to delete those branches before doing a pull?

    • git branch -d my_work_1
    • git branch -d my_work_2
  4. What if I create a branch where I do some work that I want to push on a later date, but still wants to push other changes? Do I have to tell git to ignore these somehow?

  5. Is it in general an OK workflow (once I understand how to handle the above)?

Suspect I have to update my fork on GitHub to the main of where it was forked from. Perhaps that is the issue. If so how? Simply push main?

CodePudding user response:

  1. Did I do something wrong above?

No.

  1. Is it my two local branches my_work_1 and my_work_2 that is the reason for the message?

Which message? Do you mean Do these explain the git log output? The answer to that is both no and yes, or more precisely, yes-but-only-partly. See (much) more below.

Hasn't those been merged (or smash merged as some said) into the main of the real repository?

Merged, squash-merged, or rebase-and-merged, yes. These three terms are the three ways GitHub offer the holder of the "upstream" repository of your fork, i.e., the original unicorns. GitHub give danny these three options. He has one big green button with a pulldown next to it; using the pulldown, he can select MERGE, REBASE AND MERGE, or SQUASH AND MERGE. Depending on which option he uses, you'll see different effects.

  1. Do I have to delete [my_work_1 and my_work_2] before doing a pull?

No. You can delete them at any time. Those two names simply give you an easy way to find your commits' hash IDs. When you stop wanting to find those hash IDs, delete the names.

  1. What if I create a branch where I do some work that I want to push on a later date, but still wants to push other changes? Do I have to tell git to ignore these somehow?

You can do whatever you like here. The trick is just to know what you're seeing: see below.

  1. Is it in general an OK workflow (once I understand how to handle the above)?

Yes.

What you're seeing is (a representation of) reality

A Git repository—any Git repository—mostly contains commits. The commits are, by and large, the interesting thing. Git stores these commits in a big database that Git calls its object database or object store: this thing is a simple key-value database, where the keys are raw hash IDs. You'll see the commit hash IDs, or abbreviated versions of them, in git log output.

Besides commits, there are three other types of objects in the database, but we tend not to interact with them much, and hardly ever need their hash IDs. We do occasionally need to use those raw hash IDs to have Git pull out some particular commit of interest, though. That's because they are, in effect, the true names of the commits. Every commit has a unique hash ID, and that hash ID means that commit, and only that commit. The problem with these hash IDs is that they are big, ugly, and seemingly-random.

Besides the commits, then, the repository proper also contains a name-to-hash-ID lookup database: another simple key-value store, where the key is the name and the value is a hash ID. The names are branch names like main, tag names like v1.2, and remote-tracking names like origin/main or main-remotes/danny/unicorns. The values stored under these names are hash IDs, with each name storing exactly one hash ID: one is enough.

(I say "repository proper" here to distinguish these two databases plus the ancillary files that Git needs from your working tree files, which some people like to call "part of the repository", but which I like to say aren't in the repository, because, well, they aren't! Also, the names in this database have full-name spellings: for instance, main is really refs/heads/main, it's just been abbreviated for display to drop the refs/heads/ part, which is the part that makes it a branch name. The remote-tracking names all start with refs/remotes/, which is what makes them remote-tracking names. Tags, if you have any, start with refs/tags/, which ... well, you get the idea, I hope.)

Every commit has two parts: a full snapshot of every source file, and some metadata, or information about the commit itself. The git log command normally uses only the metadata to show you what has happened, while a command like git checkout or git switch needs the saved snapshot to populate a working tree.

One of the keys to making Git work is that the metadata for any one commit contains a list of previous commit hash IDs. This list is most often just one element long, giving us an ordinary commit: one that's neither a merge commit, nor the initial commit. We call these the parents (or usually, parent, singular) of the commit. This is how git log can show history.

History, in a Git repository, is nothing more or less than the set of commits we find in the repository. Each commit "points backwards" to a previous commit—or, for merge commits, to two or more previous commits—and that's why a name can just store one commit hash ID! We can draw this, using single uppercase letters to stand in for commit hash IDs, like this, with newer commits towards the right:

... <-F <-G <-H   <--branch

Here, H is the latest commit "on the branch". The name branch "points to" (contains the hash ID of) the commit whose hash we're just calling H. Commit H contains a snapshot and metadata, and its metadata points to (contains the hash ID of) earlier commit G. Commit G, being a commit, points to earlier commit F, which continues pointing backwards.

Your git log --all --decorate --oneline --graph output (or similar transcription) does the same thing as I just did, but draws the commits vertically, with newer commits towards the top. Here's another one—some snippets from an actual repository of mine:

* dcebed7 (HEAD -> main) reader, scanner: add whitespace as a token
* acf005a reader, scanner: handle more of the grammar
* 7409df3 file: provide Is() for file errors

The branch name finds the latest commit, and from there, Git works backwards.

Every commit hash ID is unique. That is, every time you make a new commit, the new commit gets a new, unique, never-before-used-in-any-Git-repository, never-can-be-used-again hash ID.1

No commit, once made, can ever be changed. In fact, that's true for all of Git's internal objects. The hash ID is not at all random. Instead, it is the output of some hashing function, preferably a cryptographically-strong one (currently mostly SHA-1, which is no longer quite so strong). If you copy a commit object out of the objects database, change even a single bit anywhere in that object, and put it back, you get a new and different hash ID for a new and different commit. The old commit remains in the database and one can still have Git pull it up by its hash ID. The new hash ID finds the new commit, the old one finds the old commit, and both commits now exist.

We do this sort of thing—copy a commit while changing something—from time to time, and that's what you're running into here.


1This cannot work forever and someday Git will break, but the sheer size of the hash output (and its cryptographic strength) help to put that day off as long as possible—long enough, we hope, that nobody will care.


Clones, forks, and generally distributing repositories

When you clone a Git repository using git clone from the command line, you are:

  1. creating a new, empty Git repository: one with no commits, no branches, nothing inside it;
  2. having your Git software reach out to some other Git software: your Git software uses your Git repository and theirs uses theirs and I'll just call these "your Git" and "their Git";
  3. have their Git list out all their branch names and hence their most-recent commits' hash IDs; and
  4. have your Git use this information to get all their commits: the most recent, the parents, the grandparents, ad infinitum until they get all the way back to the very first commit ever.

You now have a repository with all their commits, but no branches. That's okay! Your Git is going to find (their, now yours as well) commits not by their branch names, but rather by your remote-tracking names. Your Git now takes each of their branch names, such as main, and slaps the remote name origin in front. (Technically your Git takes the full name, refs/heads/main, and changes it to the full name refs/remotes/origin/main, but with Git normally displaying this with refs/heads/ and refs/remotes/ stripped off, it looks like your Git is adding origin/.)

You now have a remote-tracking name for every one of their branch names, and because remote-tracking names work just as well as branch names, you have a way to find all the commits, just like they do.2

Last, your git clone creates one (1) new branch name—one refs/heads/-style name—in your repository, to remember a single latest commit. Which name does your Git use? The one you specified with your -b option—or, if you didn't specify a -b option, the name the other Git software recommends (which mostly winds up being main these days, though you'll see master in a lot of older repositories, and a few oddballs do something of their own). The commit your name remembers will be the same commit their name remembers, so your main will identify the same commit as your origin/main, which is your Git's memory of their Git's main.

That's kind of a long way around, but that's how you get your first branch from git clone. Having created that branch name, your Git software now does a git switch to that branch, to check out all the files from the snapshot in question. This fills in your working tree and staging area (or index or cache), but we won't go into these details here.

GitHub forks are clones, but with a few special features. When you use the GitHub FORK button, you're getting GitHub to make a clone on GitHub. They (GitHub) like it when you do this because they "cheat", using an internal Git thing called "alternates", to avoid actually copying any objects in the big all-Git-objects database. You do, however, get your own copy of the names database, and here we hit the first difference from a git clone-style clone:

  • When GitHub do a "fork", they copy the branch names straight through. So if unicorns has five branches when you mash the FORK button, you have five branches in your fork. This is true even if they immediately add and/or delete some branches right after you mash the button: your branches are an at-the-time-snapshot of theirs. From now on those names are in your repository on GitHub; it's up to you to update them.

    (This is also why there are no remote-tracking names on GitHub.)

  • Besides the change to the way the branch names are handled, GitHub link your fork to the original repository, so that you can make pull requests and the like.

That's pretty much all you need to know and care about here. When you git clone your GitHub fork to your laptop (or other computer, but I'll call it a "laptop" to distinguish it from a GitHub server computer), you'll generally want to git remote add the URL for the repository you forked. You can then git fetch from both repositories, which as we'll see in a moment is how you synchronize.


2If they had some remote-tracking names, you've "lost" those, but it turns out GitHub never bother with remote-tracking names in the first place.


Fetch and push

Now that we have two, or three or maybe a thousand or whatever, Git repositories that are all related by cloning, we have the problem of synchronizing our repositories. What if someone else has made new commits? If we want to get their new commits, we use git fetch. The fetch command takes a remote—those short names like origin, where we've stored a URL—and calls up the Git that responds at that URL. We're back to "our Git" and "their Git", just like we were during cloning:

  • our Git has them list out their branch (and other) names to get hash IDs;
  • our Git checks to see if we have the same hash IDs: if so, we have the same commits, if not, we're missing some commits;
  • our Git asks their Git for the hash IDs we don't have (and their Git is obligated to provide the parent hash IDs, which our Git can ask for, and this repeats);
  • and now we have all the commits they have, plus any of our own.

In fact, this is the same process that git clone used initially and it ends the same way: now that we know the hash IDs of their branches, we can create or update each of our remote-tracking names using those hash IDs (provided we downloaded those commits: you can tell git fetch to skip some of them, and then our corresponding remote-tracking branches won't update either).

In summary (and with caveats), git fetch gets any new commits they have that we don't, and updates our remote-tracking names. You give git fetch a remote, like origin, and it goes there and gets stuff from them. If you have only one remote—as many people do—you can stop there; if you have more than one, I recommend using git remote update to update from each, but you can use git fetch --all to fetch from all remotes. Just be careful with --all: see below.

Suppose we've made new commits and we would like to give those new commits to them? Here, we use git push. This is as close as Git gets to the opposite of git fetch, but there are several key differences:

  • First, we tell our Git what to push, usually by branch name. Our Git looks up the commit hash ID from the branch name: that's the commit we need to send them, if they don't have it. We must also send them all the history behind that commit that they don't have.

  • Second, whenever we're pushing, we don't get to use a "remote-tracking name". Instead, we ask their Git to set one of their branch names. Usually we want to use the same name on both "sides", and if we use a branch name in our git push, that's the name we want on both sides.

So we run git push origin main to send new commits we have from the most recent one on our main, and then we ask them, politely, to set their main to remember the latest such commit.

If we are the only one sending commits to them, we can be pretty sure about when we're adding commits, but sometimes this doesn't work so well. This is an inherently sticky problem if we're not the only one sending them new commits! Still, fetch and push are as close as Git gets to opposites, here.

General observations about diverging branches

It's time to step back a bit and consider what happens even if we, on our own, decide to use multiple branches. Suppose we have a very simple repository with just one branch, whose last commit is H, like this:

...--G--H   <-- main (HEAD)

Because we're about to have more than one branch, we've added HEAD to our drawings to show which name we're using to find a commit. We now create another branch name, br1. As in all cases in Git, this name must select some commit. Let's have it select the most recent main commit:

...--G--H   <-- br1, main (HEAD)

Note that all the commits—everything up through Hare on both branches. Let's create a third name, br2, too:

...--G--H   <-- br1, br2, main (HEAD)

Now we'll run git switch br1 so that any new work we do will be "on branch br1" once we commit it. (Note that work we haven't committed is not in Git, because the working tree is not actually in Git.) We get this:

...--G--H   <-- br1 (HEAD), br2, main

We're still *using commit H;* we're just doing so *via the name br1`. So nothing else changes, and in fact Git doesn't even touch any of our working tree files.

We do some work and commit it, which makes a new commit, which gets a new, unique hash ID. We'll call this commit I and draw it in:

          I   <-- br1 (HEAD)
         /
...--G--H   <-- br2, main

The sneaky thing that Git has done here is that it has stored the new commit's hash ID in the name br1 (to which HEAD is attached). That is, the name br1 now finds commit I, instead of commit H! But commit I points backwards to commit H, because when we made I, H was the current commit. Now I is the current commit.

If we make a second commit, we get:

          I--J   <-- br1 (HEAD)
         /
...--G--H   <-- br2, main

and all is fine. We can now git switch br2: Git will rip out all the commit-J files from our working tree and replace them with commit-H files; the committed files are safely saved forever in commit I, and we now have:

          I--J   <-- br1
         /
...--G--H   <-- br2 (HEAD), main

We now make a new commit, as usual. The new commit gets a new, unique hash ID, but we'll just call it K; and K points back to H, because we're on commit H when we run git commit, so now we have:

          I--J   <-- br1
         /
...--G--H   <-- main
         \
          K   <-- br2 (HEAD)

If we repeat for a new commit L, we get:

          I--J   <-- br1
         /
...--G--H   <-- main
         \
          K--L   <-- br2 (HEAD)

An interesting thing about Git is that it will claim that commits up through H are on all three branches. In a way, it's better to think of commits as being "contained in" some set of branches. The set of branches that contain any given commit are those branches where, by starting from the commit selected by the branch name, we can find that commit as we work backwards.

Because branch names simply find the last commit in the branch, we can now, if we like, tell Git to move the name main forward to point to, say, commit J:

          I--J   <-- br1, main
         /
...--G--H
         \
          K--L   <-- br2 (HEAD)

(We don't have to be "on" the branch to move it, and in some ways it's easier to move a name around when we're not "on" it, so I left HEAD attached to br2 in the drawing. The set of commands we can use to move a branch name depends on whether we're "on" the branch, which is ... kind of an annoying thing about Git, really, but it is what it is.)

Once we've done this, though, note that moving the name main to point to commit L causes commits I-J to stop being on main:

          I--J   <-- br1
         /
...--G--H
         \
          K--L   <-- br2 (HEAD), main

We can either have commits I-J be on main, or have commits K-L be on main, at this point. We can't get both sets of commits onto main at this time.

It's easy enough to get both sets of commits onto main by making a new commit M of type merge commit. A merge commit is a commit with two or more parents—usually exactly two—and if we do make such a commit, we can set things up like this:

          I--J
         /    \
...--G--H      M   <-- main (HEAD)
         \    /
          K--L

If and when we do create commit M, and make main point to it, we won't need the names br1 and/or br2 any more to find commits J and L. Git will be able to find them on its own, by stepping back one hop from M.

To create merge commit M, though, we must run git merge. The git push command can't create M for us. Why does this matter? Well, if we are the only person who ever creates commits, we can arrange things so that it doesn't matter. But what happens if we're pushing to some shared repository, where we don't control who pushes, and when?

git push and "non-fast-forward"

Suppose that both Alice and Bob have clones of some centralized repository. Alice creates a new commit or two on her main and uses git push origin main; meanwhile Bob is creating a new commit or two on his main, and doesn't have Alice's commits.

At this point, the centralized repository has:

          I--J   <-- main
         /
...--G--H

where there's no obvious reason for the kink in the graph—but I've put it in because Bob is, or was, still back at H, where both Alice and Bob were just a short while ago. Bob makes his new commits and gets:

...--G--H   <-- origin/main
         \
          K--L   <-- main

in his repository. When he runs git push origin main, his Git calls up origin's and sends over commits J-K, which now look like this:

          I--J   <-- main
         /
...--G--H
         \
          K--L   <-- [bob asks, politely, to set "main" here]

What happens now is simple enough: they just refuse, telling Bob's Git that if they did that, they'd "lose" commits I-J. This shows up at Bob's end as a "non-fast-forward" error.

If Bob could push to a new branch (bob), that would be fine. Then it might be possible to do the merge on GitHub. I say might because some merges are easy—they have no conflicts—and some aren't. GitHub originally wouldn't do any conflicted merges at all, though they are gradually making GitHub more feature-ridden er rich here.3

Some people, however, don't like merges. This is one place that git rebase comes in. Git's "squash merge", represented by the GitHub SQUASH AND MERGE button, also comes into play here. GitHub's REBASE AND MERGE button is strongly related, but it's ... well, let's just get on with rebase now.


3Seriously, there's nothing wrong with fancier tools. Just remember Scotty's Dictum, "The more they fancy up the plumbing, the easier it is to stop up the drain."


Rebasing

As I mentioned above, we will sometimes copy a commit in order to improve it. The simplest Git command for copying a commit is git cherry-pick, which is often used like this:

git switch somebranch       # switch to some branch
git cherry-pick a123456     # commit hash ID from `git log`

The cherry-pick operation copies the effect of the given commit. That is, commit a123456, in this case, has a snapshot, and has a (single) parent—we normally only copy ordinary single-parent commits—and if we have Git compare the parent's snapshot with a123456's snapshot, there's some set of changes that we (or whoever) made.

To achieve the cherry-pick operation, Git uses its internal merge machinery to make the same set of changes to our current commit, which in this case would be the most recent commit on somebranch. Hence, if we had a graph like this:

          o--P--C--o--o   <-- branch-xyz
         /
...--o--o
         \
          o--o--H   <-- somebranch (HEAD)

and commit C is the a123456 one whose hash ID we gave to git cherry-pick, Git will compare the snapshot in P (C's parent) to the snapshot in C, to see what changed in that commit.

In order to apply the same change, Git needs to compare the snapshot in P to that in H. That way, if commit H has the same code in it that P does, but it's been moved around within a file, or maybe even moved to a different file, Git can (usually) figure out where the code went. Then Git can apply the change to the correct snapshot-H file, at the correct line(s). This operation is, technically speaking, exactly the same thing Git does for a full-blown git merge: that happens to do exactly the right thing here. (Of course, because it is a merge, it can have merge conflicts, but as long as the code in snapshot H is sufficiently similar to that in snapshots P and C, that's not too likely. And if it does occur, we probably have to think about what might need to be altered in the P-vs-C change anyway.)

Having made the same change, git cherry-pick goes on to copy most of the metadata from the original commit: it saves the original commit's log message, and even keeps the original commit's author. It makes us the committer though, and then makes a new commit that's "just as good as" the original, but which adds on to our current branch:

          o--P--C--o--o   <-- branch-xyz
         /
...--o--o
         \
          o--o--H--C'  <-- somebranch (HEAD)

Looking at commits C and C', we'll mostly "see" the same thing, even if we include displaying a patch for the two commits. The hash IDs will differ, though, and of course commit C' is on our branch.

Now suppose we take the Alice-and-Bob situation and look at this as a case of just needing to copy the commits:

          I--J   <-- alice-main
         /
...--G--H
         \
          K--L   <-- bob

Suppose we copy K to a new-and-improved K' whose biggest change—maybe even only change, in some ways—is that it comes after J, and then copy L to a new-and-improved L' in the same way:

          I--J   <-- alice-main
         /    \
...--G--H      K'-L'  <-- bob-version-2 (HEAD)
         \
          K--L   <-- bob

We can now abandon the original K-L commits in favor of the new and improved K'-L' pair. To do that, we have Git forcibly yank the name bob over to point to L', and delete the temporary bob-version-2 name:

          I--J   <-- alice-main
         /    \
...--G--H      K'-L'  <-- bob (HEAD)
         \
          K--L   ???

We no longer have any name by which to find commit L. It will seem to be gone from our repository! It's still there, it's just that git log --all or git log --branches uses the branch names to get started, and there is no branch name that starts us looking at L any more.

If we're not paying attention, it even looks like commits K-L have somehow changed. They haven't: there are instead new commits. But we can now git push origin bob:main (to throw some never-before-shown syntax at you

  • Related