Home > Software engineering >  What does git checkout do?
What does git checkout do?

Time:11-04

I thought that the git checkout command updated files in the working tree only. In fact, the man page reads:

git-checkout - Switch branches or restore working tree files

However, I've just run this command:

git checkout 11cb5b6 -- hello.txt

and, in addition to updating my working tree copy, this command also updated my index. Before the command, git status was giving me a clean result:

nothing to commit, working tree clean

but immediately after the checkout, it reads:

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
    modified:   hello.txt

i.e. the working tree file has been updated AND staged. What am I missing?

CodePudding user response:

when you do git checkout with specific branch or commit, the staging area is updated as well as the working directory. So, the version of hello.txt that is in head's parent and copy it to both the staging area and the working directory.

If you do not specify a commit or branch then the contents of hello.txt will be copied from the staging area to the working directory. The staging area itself is not changed.

CodePudding user response:

The git checkout command is very complicated. It's so complicated that someone finally, as of Git 2.23, split it into two separate commands:

  • git switch, which does the "select other branch or commit to have checked out" operation; and
  • git restore, which does the "update some files in index and/or working tree" operation.

This still doesn't mention several additional modes of operation (git checkout -m for instance),1 but at least separates out all the "restore files" options, of which there are many.

You're using the "restore files" mode of git checkout and, as shrey deshwal noted, this operation will:

  • write to your working tree (always); and
  • write to your index / staging-area (sometimes).

When using git restore instead of git checkout, you control which of the index/staging-area and working-tree files are updated, using the -S (staging area) and -W (working tree) options. This is not possible with git checkout: git checkout always writes to the working tree, and also writes to the index/staging-area if you specify a commit or tree object as the source of the file to be written to the working tree.

If you have Git 2.23 or later, use git restore to do this: its operation is less confusing and more direct. You specify the --source for the file, or let it default to using the index:

git restore --source 11cb5b6 -- hello.txt

This writes to the working tree (only). Or, add -S and/or -W to write to the staging area (index) with -S, and/or working tree with -W:

git restore --source 11cb5b6 -SW -- hello.txt

This writes to both staging-area and working-tree (because both -S and -W are given).

By contrast, git checkout -- file makes the source less obvious (it's the same as it is in git restore, but less obvious) and gives you no choice of target(s) (always -W, but -S gets added if the source is a commit or tree). The git restore command also documents the --overlay vs --no-overlay modes properly. This option applies only to the "restore files" mode of git checkout (where it is now documented, but it's not clear that it only applies to this mode!).


1The -m option to git checkout:

  • re-creates a merge conflict, or
  • does a merge operation during a checkout, as if you ran a rather complicated series of Git commands that ends up with you on the target branch, having then merged against the uncommitted code in your working tree.

This second operation is somewhat dangerous: as the documentation now notes,

When switching branches with --merge, staged changes may be lost.

The first operation cheerfully destroys any merge work you started in the working tree copy of the file. So git checkout -m is always "dangerous" in the git restore way: it will wipe out uncommitted work without asking. I kind of wish that these had not been left in the git switch command, but they were.


Read only if the above still doesn't make sense:

If this stuff still doesn't come together, you're probably missing out on a key concept in Git: how the index / staging-area really works.

A Git repository is, to a large extent, just a big database of commits. What you do with this repository is add more commits. Each commit itself is:

  • Numbered. Each commit has a big, ugly, incomprehensible hash ID number such as e9e5ba39a78c8f5057262d49e261b42a8660d5b9 (often abbreviated, e.g., e9e5ba3). These appear random, though in fact they're entirely non-random.

  • Storage: each commit stores two things:

    • A commit has a full snapshot of every file. Commits don't store changes, so when Git shows you changes, it's really doing a git diff of two snapshots.

    • A commit also stores some metadata, or information about the commit itself. This includes things like the name and email address of the author of the commit (from user.name and user.email). It includes some date-and-time stamps. It includes a log message, which git log or git show will show before any diffs. And, crucially for Git's internal operation—though we won't cover any of the details here—each commit stores a list of previous commit hash IDs. Most commits just store one such hash ID, which is the "parent" of the commit. That's how Git finds the previous commit, so that it can show you what changed.

All of this stuff inside the commit is completely, totally, 100% read-only: no part of any commit can ever be changed once it's made.2 But this leaves us with a dilemma: if no part of any commit can ever change, how can we get any new work done?

Git's answer is the same as the answer is other version control systems: sure, the commits are read-only forever, but you don't do work with the committed files. We copy the files out of the commit, into a work area. You work on / with those files. That area is your working tree or work-tree. The copied-out files are ordinary read/write files, that your computer can do ordinary work on or with.

So far, none of this is particularly bizarre or incomprehensible. A commit is like an archive of files, like a tar or rar file made out of other files, but with special Gitty features like metadata and a weird random-looking number. We use git checkout or git switch to pick one: Git extracts the files, and now we can work on them.

But here's where Git gets weird. If you've used other version control systems, you are probably used to this idea: you work on the files and then you tell the VCS to commit them and it does. That would be simple, so Git doesn't do that.

When Git goes to build a new commit, Git does not use your files at all! Instead, Git uses a secret extra copy. Only it's not actually secret, and it's not usually a copy either. What it is, is hidden. This extra "copy" of each file is in what Git calls by three different names:

  • The index: a meaningless term. Meaningless is sometimes good, because then there's no preconceived notion to push out of the way for some weird technical reason. But it makes it a bit hard to remember.

  • The staging area: this is how you use the index, so it makes sense. But this obscures the technical details, which do matter. You need to be aware of them.

  • The cache: this is the worst name of all, because it's how Git itself sometimes uses the index, but not how you use it, and doesn't cover all the ways that Git uses the index. This term is mostly defunct, except that it appears in flags like git rm --cached or git diff --cached.

Sometimes the --cached flag has --staged as a synonym: git diff --staged does exactly the same thing as git diff --cached. Sometimes they don't: git rm --staged rejects the --staged entirely. Oddly, git restore has only --staged. Getting rid of --cached entirely might be a good direction; maybe Git will eventually do that. But in any case, you need to know all three names, as "the index" appears in various places. In particular, the index has a special role during conflicted merges, and it determines whether files are tracked or untracked. We won't go into this level of detail here; we'll only talk about the index as it pertains to making new commits.

When you run git commit, instead of Git reading from your working tree to find out which files you've changed, Git simply packages up the files that are in its index at this time.3 To make that work, the initial git checkout or git switch step first fills in Git's index.

What this means is that after you've checked out some commit to work on it, you have three "copies" of each file:

  • There is a read-only, Git-ified, compressed-and-de-duplicated special format version of the file in the commit.
  • There is another "copy" (see below for the reason for the quotes here) of that file in Git's index / staging-area.
  • Last, there's a usable copy (no quotes this time) of the file in your working tree. That's the one you can see and edit.

When Git stores a file permanently in a commit, it:

  • compresses the file, so that it takes less space;
  • sometimes, super-compresses the file (this happens late in the game);
  • always Git-ifies the file: it's not stored as a file (with a name and other OS attributes), but rather as an internal Git blob object (nothing but a hash ID name, no modes, etc.; the names and modes are stored in additional Git objects called tree objects).

In other words, these files are Git-ified and put into a database, rather than kept as regular files. Whenever Git does this, it automatically de-duplicates the content. So even though every commit stores every file, the repository doesn't bloat up out of control when you have millions of commits, as many of the commits have the same copy of some file, and those are all shared. Instead of a copy of the file, we get a "copy" of the file, inside the commit: hence the quotes.

The index stores pre-Git-ified, pre-compressed and pre-de-duplicated "copies" in this same way that commits do. The git add command therefore:

  1. reads the working tree version of the file;
  2. compresses it and otherwise Git-ifies it: this produces an internal hash ID for de-duplication purposes;
  3. decides whether the file is a duplicate, or not.

If the file is a duplicate, git add throws out the copy it just made: we don't need that one, we have one in the repository already. Git updates its index with the duplicate, and the file is now ready to be committed, all stored in Git's index.

If the file isn't a duplicate, git add takes the now-prepared compressed file and readies it to go into the database, sort of temporarily added.4 And now Git has a copy that is ready to be committed, stored in its index.

So:

  • we started with some file path/to/file.ext in the index, ready to be committed;
  • then we Git-ified the (real, OS-level) file file.ext in folder to in folder path with git add;
  • then git add updated the index "copy" if needed, and now we have path/to/file in the index, ready to be committed.

Hence, before git add, the index contained the proposed next commit. After git add, the index still contains the proposed next commit. What git add just did was update the proposed next commit.

If you add an all new file, the same sequence happens: git add reads and compresses the ordinary OS-level file, figures out whether the contents are a duplicate or not—maybe you've added world.txt which also contains hello world, which is a duplicate of existing hello.txt for instance—and git add has updated the proposed next commit so that new file world.txt is listed there too.

In all cases, Git always has the proposed next commit set up in its index.5 Running git commit means use the proposed next commit as it is now, which is why you can partially stage stuff: what you're doing is adding, to the index, copies of some files that do match the working tree copies, and copies of other files that don't match the working tree copies. Since the index holds its own copies (or "copies" due to pre-de-duplication), this means there are always three "active copies" of each file:

  • There is the HEAD (or current commit) copy. This one is Git-ified and can't be changed because it is in a commit.

  • There is the index copy. This one is Git-ified, but can be changed, because it's only a proposed commit, not a real one yet.

  • Finally, there's the working tree copy. This is the only one you can see and work on / with.

You use and modify the working tree copies. You use git add or git rm to create or remove the index copies, and then you use git commit to turn the proposed next commit into an actual commit.


2This means that git commit --amend is a lie. It doesn't amend the commit, it makes a new and improved replacement. The old commit still exists! This is true of git rebase as well. Things that, in Git, seem to change commits, don't really change them at all. You can tell by saving and comparing those hash IDs—but humans normally just sort of bleep right over them, which is a good thing: it lets us replace bad commits with better ones, without humans noticing that we did that.

3The git commit command offers a flag, -a, that means in effect: first, run git add -u, then run git commit. There are a bunch of subtleties about this, but the key item to note here is that this runs git add -u. The -u option will only update tracked files, so you can't use this for new files. Git therefore forces you to learn about git add.

4Git in fact just adds the object right away, and throws it away later if appropriate, but you can view this as "temporarily added" if you prefer.

5When you're in the middle of a conflicted merge, Git knows that you're in the middle of a conflicted merge because there are some index entries that have a higher "staging number" than staging-number-zero. In this mode, Git won't make a new commit from the index at all, so one can argue that in this mode, the index doesn't hold the proposed next commit.

  •  Tags:  
  • git
  • Related