On Git, I have currently some staged but uncommitted changes on the master
branch.
Instead of committing to the master branch, I want to
- create a new branch, say
development
; then - move the staged changes across to the new branch, and
reset
/clear the staged changes onmaster
; then - commit the staged changes on the new branch; then
- push commit to remote repo; then
- merge this commit from
development
to themaster
on remote, and keep thedevelopment
branch; then - refresh the local
master
from remotemaster
, without changing my existing uncommitted files on local
May I ask how I should do that? I am still a beginner of git, so please explain in small steps so I can follow.
Note1: My staged change consists of more than 100 files, so it would be a pain to hand-pick and manually add
them one by one to the new branch. I am trying to avoid this error-prone way if possible.
Note2: There are over 30 files which I have not staged the changes. I want to keep these changes locally even after I get a refresh from the remote master
.
A BIG THANK YOU TO YOU
CodePudding user response:
TL;DR
Changes are not staged "on a branch". In fact, Git doesn't have changes at all: Git only has snapshots.
What does this mean? Well, it means the short answer to your question is:
- create the new branch name (
git branch
); then - switch to the new branch name (
git switch
orgit checkout
); then - commit (
git commit
).
You can combine the first two steps with git switch -c development
or git checkout -b development
. Both commands do the same thing: git switch
was new in Git 2.23, as part of a project to split the overloaded git checkout
command into two separate commands; the old git checkout
remains in Git and probably will for the next 20 years, but it's a good idea to slowly migrate yourself to the new ones.
It's important to realize that this process—git switch -c development
in particular—makes use of a short-cut. It won't work for certain other cases, but it will work for this case.
Longer
This really deserves a longer explanation though. Why does the above work? What you need to know starts with this:
Git is all about commits
Newbies to Git often think that Git is about files, which is natural enough: we store files in Git. Or, they might think Git is about branches, which is also natural enough: we're always "on" some branch—as in, git status
says on branch master
or whatever. Technically, you can also be on no branch, in what Git calls "detached HEAD mode", but except for some special cases, you don't normally want to work that way.
The thing is, neither of these views is right. Git is, in the end, all about commits. It's true that each commit stores files, and it's true that we form our commits into branches, that we also find with branches—or more precisely, with branch names, which we (rather sloppily / lazily) call branches even though we call other things branches too. But in the end, Git is about the commits.
(Note: If you're getting a sense of there being something wrong when we use branches to find branches, you're on the right track: there is something wrong with this notion. It's the fact that the word branch is badly defined. It's (ab)used for multiple things.)
A repository is, at its heart, a collection of commits, plus a few other things we'll get to in a moment. These commits are one of four kinds of Git objects, and Git stores all these objects in a big key-value database, the objects database, which uses hash IDs (or more formally object IDs or OIDs) as the keys. Git desperately needs these object IDs in order to find the commits in the database.
These hash IDs are big, ugly, and random-looking. For instance, 9bf691b78cf906751e65d65ba0c6ffdcd9a5a12c
is a particular commit in any clone of the Git repository for Git itself. Every commit gets a unique hash ID: all Git software everywhere in the universe agrees that that 9bf691...blahblah
thing means that commit, even if this particular repository never has had that commit and won't ever get it. Git makes up a new unique hash ID every time you make a commit.1 This means that all you need to find the commit is the hash ID—but again, Git really needs that hash ID, so that it can look in its objects database. Either it has that object, so it has the commit, or it doesn't. If your Git repository lacks the commit, you'll need to get the commit from some repository that has it. We'll leave out the details, but this is what git fetch
is about.
Anyway, given that the commit is so important, you'll need to know exactly what a commit is and does for you. So, besides the weird random-looking "number" (hash ID), here's what you need to know:
Every commit is read-only. The numbering system requires this.
Every commit contains (indirectly) every file. More precisely, a commit has a full snapshot of every file (that it has), as a sort of archive: a tarball or zip file or WinRAR or whatever, of your source files. Git stores these very cleverly—including de-duplicating the contents—so that the contents get shared across, and even within, the commits, so even though every commit has every file, most commits are truly tiny. The first one ever, in a new repository, isn't, since that one has to store all the files for the first time, but after that, most commits mostly re-use most files, so those re-used files take no space.
Besides the snapshot, each commit contains some metadata, or information about the commit itself. This includes the name and email address of the person who made the commit, for instance.
Except for your name and email address (which Git gets from your user.name
and user.email
settings), Git mostly builds all the metadata on its own. You just run git commit
and Git makes a snapshot—we'll see "from where" in a moment—and adds on the metadata. One of the most important pieces of metadata in any one commit, for Git, is a list of previous commit hash IDs. Most commits have exactly one entry in this list: we call these "ordinary" commits.
This single previous commit hash ID, stored in an ordinary commit's metadata, makes the commit "point to" its parent. That is, the commit remembers which commit comes before this particular commit. If we like—and I do like—we can draw this, putting newer commits towards the right and older ones towards the left, like this:
... <-F <-G <-H
Here H
stands in for the "h"ash ID of the last commit in the chain. Commit H
has a full snapshot of all files, plus some metadata, and the metadata in commit H
contains the hash ID of earlier commit G
. So H
points to G
, as represented by the little arrow sticking backwards out o H
.
This means that if we can get Git the hash ID of commit H
, Git can use commit H
to find earlier commit G
. Of course, G
is an ordinary commit too, so it has the one arrow sticking out of it, pointing backwards to its parent F
. Git can now find commit F
. As long as this, too, is an ordinary commit, it points backwards to yet another commit.
In this manner, Git can find every commit in a chain, as long as Git can find the last commit in the chain. All we have to do is memorize the hash ID of our last commits. Of course, memorizing 9bf691b78-ugh-glah-whatever
is horribly painful, so Git gives us a way to avoid that.
1We can prove mathematically that this idea is doomed to failure. The sheer size of the hash ID space, however, puts the date of failure far enough into the future that—we hope—we will never have to care.
Branch names find the last commits
To avoid having to memorize the hash ID of commit H
, we simply tell Git that we'd like to have a branch name, such as master
. Git sticks the last hash ID into the name:
...--G--H <-- master
The name now points to the last commit, from which Git can find every earlier commit. That's all there is to it—well, almost all.
As I mentioned earlier, Git likes us to be "on" a branch. Being on some branch means that the special name HEAD
is attached to the branch name: that's how Git knows which of our many branch names we're actually using.
Let's add a new branch name now, and make it also point to commit H
, like this:
...--G--H <-- development, master (HEAD)
This means we're "on" master
. Both names select commit H
, and commit H
is the commit we're using right now, but we're doing that using the name master
.
If we run git switch development
now, we get:
...--G--H <-- development (HEAD), master
We're still using commit H
, but now we're doing so through the name development
. This matters when we create a new commit. Because we're using commit H
, our new commit will point backwards to H
, but the hash ID of that new commit will get saved away in the current branch name, like this:
...--G--H <-- master
\
I <-- development (HEAD)
If we make another new commit now, this new commit J
will point backwards to I
—because we're on commit I
now, through name development
—and Git will update the name development
again:
...--G--H <-- master
\
I--J <-- development (HEAD)
This is how branches grow, in Git. Should we switch back to master
now:
...--G--H <-- master (HEAD)
\
I--J <-- development
Git will take away the development
, i.e., commit-J
, files, and put back all the commit-H
files instead.
Git's index and your working tree
I mentioned (briefly) that Git's commits are read-only, and the files are stored in the commit in a sort of archive fashion, with compression and de-duplication. What this means for us is that we literally can't read those files—only Git can read them—and literally nothing, not even Git itself, can overwrite them. That's great for archival—which is what the commits are doing, at least at the first level—but useless for getting any new work done.
To get work done, then, Git has to un-archive the files in a commit. When we switch around, with git switch
or git checkout
, Git will extract all the files from the commit we're moving to. First, of course, Git has to remove all the files we're moving away-from. Then Git extracts all the files somewhere that you can work. That's your working tree or work-tree. You can now get work done!
Git's de-duplication trick comes into play here. Removing and replacing files is kind of slow, so Git checks, before switching from commit to commit, which files are duplicates. For those files, Git doesn't have to do anything at all—and then it doesn't. And, if we are switching from, say, commit H
to commit H
, that means every file is a duplicate and therefore Git does not have to remove-and-replace any files.
That's why creating a new branch name, then switching to it, is safe here. No files need to be reworked; no files need to be touched at all. So Git touches no files, and all is well.
There's more to say about this, though. Often, Git does have to fill in the working tree. Consider, for instance, the case where you have just cloned a repository and Git is filling in your working tree for the first time ever. You might think: Ah, well, Git just extracts all the files. That's what other version control systems do, for instance. It would be sufficient. But that's not what Git does.
Instead, Git has a kind of tracking system for files, that Git calls by three names: the index, the staging area, and sometimes (mostly now in flags like git rm --cached
) the cache. These three names all refer to the same thing.
When Git is extracting a commit, Git fills in its index and your working tree with the files from that commit. The copy (or "copy" perhaps) in the index is pre-de-duplicated, stored in Git's internal read-only format, but unlike the copy in a commit—which is frozen for as long as the commit itself continues to exist—the index copy can be replaced wholesale. Since the initial copy (or "copy") is a duplicate—of whatever is in the commit—it's automatically de-duplicated away to almost nothing. (The index entry itself still takes some space, to hold the file's name and a bunch of cache data.)
Again, this index is the "staging area": these are two names for the same thing. When you modify a working-tree copy of a file, nothing happens to the index copy—not yet! It just sits there, still holding the de-duplicated duplicate from the commit.
When you run git add
, though, Git reads the working tree copy of the file, compresses it into the internal format that Git uses in commits, and checks for duplicates. If the compressed file would be a duplicate, Git throws out the compressed copy it just built and uses instead the existing compressed copy. Otherwise, it saves away the compressed copy it just made. In either case, Git now swaps that new-or-re-used copy/"copy" into the index.2
The result of all this is simple:
- before
git add
, the index held a copy of every file (pre-de-duplicated), ready to commit; - after
git add
, the index still holds a copy of every file (pre-de-duplicated), ready to commit.
This means that at all times, the index holds a copy of every file, ready to commit. In effect, the index holds the snapshot for your proposed next commit.
When you run git commit
, Git just packages up all the files that are in the index, in the form they have right then, to be used in the new snapshot. This will be the archive for the new commit. Git also gathers or generates the necessary metadata at this point—using the current commit hash ID as the parent—and writes all of this out to obtain the new commit's random-looking hash ID, and then git commit
stores the new commit into the database and stuffs its ID into the current branch name.
It's all really relatively simple. So where do changes come in?
2Technically, the index entry holds the file's name, some cache data, and a blob hash ID for an object in the all-objects database. You don't really need to worry about this. You can think of the index as holding a full copy, if you prefer.
If a commit is a snapshot, why do we see changes?
If you run git show
on an ordinary commit, Git will:
- spit out the metadata; then
- show a diff to show you what changed.
Git computes this diff at this time! Git uses the metadata in that ordinary commit to find the commit's parent. Git then extracts both snapshots (to a temporary area, in memory really) and compares the de-duplicated files. Since the duplicates are now trivial to spot, Git really only has to compare the differing files. For each such file, Git computes a set of changes that, if applied to the parent's copy of the file, would produce the child's copy of that file.
That's the diff you see: the one git diff
just made by comparing a parent to a child. (The git show
command invokes the same internal code as git diff
for this. It's just that git show
automatically locates the parent for you. If you want to use git diff
this way, you have to pick out both commits. The upside of having to pick out both commits is that you can pick out any pair of commits.)
When you run git status
, Git:
- prints stuff about the current branch first (
On branch master
for instance); - runs a
git diff
to compare the current commit snapshot to the index snapshot; and - runs one more
git diff
.
That first comparison—what's changed between the current commit, and the proposed next snapshot in the index / staging-area—can quickly skip over the de-duplicated identical files, and compare just the different files. Since it's not going to emit an actual set of changes, it short-circuits that code—you can do this yourself with git diff --name-status
—and shows just that some file is changed.
Any file that shows up here as changed is listed under changes staged for commit. New files, or deleted files, here, show up the same way. (Git also does rename detection here; we won't cover this properly.)
Having listed these "staged for commit" files, git status
is done with the first diff. It now goes on to do a second git diff --name-status
, this time comparing what's in its index to what's in your working tree.3 For each file that's the same, once again, Git says nothing. But for files that are different, Git now mentions the file's name, and lists this file under changes not staged for commit.
There's a bit of weirdness here though. Suppose you delete a file from your working tree, without deleting it from Git's index, using your OS's "remove a file" command (whatever that is for your OS). Git will say that a deletion of this file is "not staged for commit". That makes sense, and matches the first kind of diff: if you use git rm
, which removes both the index and the working tree copy, you see the deletion as "staged for commit" (and then the index and working tree lack-of-copy matches, so it's not mentioned again).
But suppose you have an all-new file that you have not git add
-ed, in your working tree. Git saves these file names for after the diff output. It then goes on to whine about these files as being untracked.
3Since the files in your working tree aren't compressed-and-de-duplicated—they're just ordinary files—Git has to work a lot harder here. We'll skip all these details too.
Untracked files and .gitignore
An untracked file is a file that is in your working tree, but not in Git's index. That's all there is to this, except for the fact that the index is something you (partially) control, with git add
and git rm
. It's crucial that you understand this, because of some weirdness with .gitignore
.
The .gitignore
file is rather misnamed. This doesn't ignore a file. Git commits whatever is in Git's index: you run git commit
, whatever is in Git's index goes into the new commit. What .gitignore
does starts with this: it makes git status
shut up.
When you run git status
, it whines a lot about all your build files, for instance. But they are untracked on purpose, and we want them to stay that way. Complaining about them is counterproductive. So listing those in .gitignore
tells Git: shut the ____ up.
The git add
command also has some en-masse "add everything" modes: e.g., git add .
or git add --all
. These look for all the files that Git would complain about, and adds them. Since listing the files in .gitignore
makes Git stop complaining, it also makes Git stop adding those files, if they're currently untracked.
What .gitignore
doesn't do is stop Git from committing the files when they're tracked. If some file is in Git's index, it will be committed. The "add all" or git add -u
(update) modes will update the index copy from the working tree copy. So this should be called .git-do-not-complain-about-these-files-when-they-are-untracked-and-do-not-add-them-with-an-en-masse-git-add-operation-either
, or something. But nobody wants to type in a file name like that, so .gitignore
it is.
Conclusion
Once you know what the index is and does and how Git swaps out index and working tree copies when you change commits, it becomes clear that as long as you don't change commits—such as, when you create a new branch that still selects the current commit, and then switch to that branch—it's quite safe to just make a new commit on a new branch by making the new branch first.
(Later, once you understand how branch names "point to" commits, it's easy to see how to make a commit, then create a branch name, then move the other branch name back one step, to achieve the same thing. But that's more work, so you might as well do the simpler and clearer thing.)