Home > Back-end >  Why `commit` is needed before `git checkout`?
Why `commit` is needed before `git checkout`?

Time:07-21

I thought git add is enough to log the modifies made before checkout to another branch.

Therefore, why do the modifies should be logged to local repository by commit?

CodePudding user response:

It's because you can patch diffs in and out of your staged content in both directions, from history or from your work tree, at any time. Commit says "yep, this whole set is worth revisiting later", but before then it's still actively being edited, by you.

git add doesn't have to be treated as provisional; for people unused to thinking of their vcs as a tool helping them iterate drafts of the history they're eventually going to publish it seems weird and unnecessary. Once you start thinking of your local history as a whole-tree, long-term extension of your editor's undo buffers it makes much more sense. Git helps you do the work as well as work out how best to present it.

CodePudding user response:

I thought git add is enough to log the modifies made before checkout to another branch.

This is not the case.

Git is, at its heart, all about commits. A commit is a numbered entity—it has a unique hash ID, that random-looking string of letters and digits such as 30cc8d0f147546d4dd77bf497f4dec51e7265bd8. Git uses the hash ID to find the commit in its "objects" database, which stores the commits and various supporting objects needed to make the commits actually work. Each commit stores two things:

  • a full snapshot of all of your source files (in a special, read-only, Git-only, compressed and de-duplicated format so that making hundreds of commits doesn't take much space at all—most of the files are de-duplicated), and

  • some metadata: information about the commit itself, such as the name and email address of the author of the commit, and some date-and-time-stamps and so on.

Commits are mostly permanent—you can get rid of a commit by just making it so that you can't find it unless you've memorized its hash ID, and then Git will (after a lot of time) eventually discard the commit entirely—and completely read-only. Being read-only, though, and in a format that only Git can read in the first place, you literally can't actually use a commit at all! This means that for you to use a commit, Git has to extract a commit.

Most version control systems behave this way. That is, you say "I'd like to work on commit XXXXX" (for some hash ID—in Git, you will normally use a branch name here, but that translates into a hash ID internally)—and the version control system says OK, let me go find and extract the archive that is stored under that ID. In traditional version control systems, ten minutes later or so you have a version you can work on / with. You do your work and then you run their "commit" verb and wait another half an hour for them to make a commit.

With Git, you say "extract the commit" and milliseconds (or maybe a few seconds) later you have the commit and can start work. But you can't just run git commit when you're done working: instead, Git makes you run git add first, and then run git commit.1 The important question is not "do I have to do this", but rather rather "why do I have to do this?" The answer lies in Git's index or staging area. Git requires that you be aware of what the index is and does for you.


1In some cases, you can run git commit -a, but I don't recommend this because it skips right over the key reason you're using git add in the first place, and leaves you stranded when you hit one of the situations where git commit -a doesn't work (there are several of these; some depend on whether you use "pre-commit hooks" and how good someone was at writing a pre-commit hook, for instance). It also means that you can't grasp why .gitignore sometimes doesn't work, what happens with git stash, and numerous other cases.


Git's index, aka staging area, aka cache

The thing that Git calls by three names—index, staging area, and (rarely these days) cache—has multiple purposes, but its most important one, the one you really have to know at all times, is this: the index or staging area holds your proposed next commit.

When you run git commit, Git builds a new commit in milliseconds or seconds, not the minutes or hours older version control systems tended to take. The reason Git can do this is that the index always holds the proposed next commit. When you run git commit, Git hardly has to do any work at all: it just takes all the files that are in the index right then and freezes them into a new commit.

What's in the index is hard to see, kind of on purpose. There is a Git command that can list its content—git ls-files—but it's not one users will normally use; it's not very useful. Here's a snippet of git ls-files --stage output, as run on the Git repository for Git:

[...]
100755 af5da45d2878e07ffe4586bfb8c1dc16134f9e95 0       Documentation/cmd-list.perl
[...]
100644 1fb1b2ea90c5953eb465d3b108c01f23fb442a32 0       commit.c
100644 21e4d25ce7878ab6463110b6b3104c7b99d159cb 0       commit.h
100644 c531372f3ff0f13839a2056568cbaa7fd37f98a8 0       common-main.c
100644 40dbfb170dabc5a43d1194bd2fb32fe1580ff980 0       compat/.gitattributes
100644 19fda3e877641c3c083618928c241351f1caaf91 0       compat/access.c
[...]
100644 8442bd436efeab81afc25db9d89da082638fcca4 0       xdiff/xtypes.h
100644 115b2b1640b4504d1b7eb1bc4dc1428b109f6380 0       xdiff/xutils.c
100644 fba7bae03c7855ca90aff3f238321581a91a6676 0       xdiff/xutils.h
100644 d594cba3fc9d82d94b9277e886f2bee265e552f6 0       zlib.c

For every file that would go into a new commit (if I made one right now), the index has one of the above entries. This entry gives the file's mode—"executable" or "not executable", which for silly historical reasons is spelled 100755 or 100644—and the file's name and the (hash ID of the) de-duplicated, ready-to-commit file content, among other information. The 0 in last column before the file's name is the staging number, which in normal use is always zero.

So, if I were to run git commit right now, Git would use the files in the index to make a new snapshot. Well, actually, it would complain:

$ git commit
On branch master
Your branch is up to date with 'origin/master'.

nothing to commit, working tree clean

because all the files in the index match all the files in the current commit! (Git can tell this quite fast—just a few milliseconds—because the contents are all pre-de-duplicated, so the internal hash IDs all match.)

What running git add on a file does is update the index / staging-area. That is, suppose I open up Makefile or xdiff/xutils.h in my editor and make changes to this file. I write these changes out. The changes I made go into the working tree, which holds the extracted files. The working tree does not hold the committed files: those are in a special Git-only de-duplicated format and my editor and compiler can't even read them. Nothing—not even Git itself—can change them. But the working tree files are just ordinary everyday files, so everything can read and change those, including me and my editor.

If I do change them, though, I must prepare an updated, de-duplicated copy to be committed. To do that, I must run git add. The git add command reads the working tree copy and compresses it down and checks for duplicates:

  • If there is a duplicate, git add makes a note of that while updating Git's index / staging-area. The compressed copy git add made is no longer needed; it gets discarded.

  • If there is no duplicate, git add makes a note of that while updating Git's index / staging-area. The compressed copy git add made will be needed, but it's ready for committing now!

So in either case the index is once again ready for the next commit. The git add command has merely updated the proposed next commit by updating the index aka staging area.

Running git commit saves what's in the index

If I now run git commit, Git will make a new commit, with the new (i.e., updated) snapshot I prepared by running git add. This new snapshot will be saved forever—or rather, as long as the commit itself exists. Git will add this new commit, with its new snapshot and metadata, to the database of "all Git objects". (Any all-new files are now objects in the database as well, in their compressed format: they weren't duplicates. If they had been duplicates, they wouldn't be all-new and git add would have re-used the old saved content.)

(The new commit then becomes the latest commit on the branch, in the usual fashion; we'll skip all the details here.)

This is why it takes at least two commands to make a new commit: git add prepares things, updating the proposed next commit, and git commit turns the proposed commit into an actual commit. Until then, it's just a proposal.

Another way to look at this is that, at all times, there are three copies of each "active" file (assuming it's in the current commit):

  • There's a frozen-for-all-time, unchangeable copy of some file in the current commit.
  • For that particular file, there's a frozen-format (i.e., de-duplicated) copy in Git's index too.
  • And, for that particular file, there's an ordinary, usable, read/write copy in your working tree.

What git add does is copy the working tree version into the index.

This explains how git status works as well

When you run git status, Git says things like:

On branch main
Your branch is up to date with 'origin/main'.

Changes staged for commit:

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        modified:   file1

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   file2

What Git has done is to run two git diff operations:

  • First, Git compared what is in the current commit to what is in Git's index / staging area. Here, every file matched except for file1. File file1 is different. So Git calls that one out as "staged for commit". If I were to run git commit right now, the new commit would have a different copy of file file1, than the copy that's in the current commit.

  • Then Git compared what is in the index to what is in my working tree. Here, every file matched except for file2. File file2 is different. So Git calls that one out as "not staged for commit". Running git commit won't store the updated file2 unless I run git add on it. Instead, it will store the unchanged file2 that's still there in Git's index.

Having listed these out, Git may go on to report untracked files. An untracked file is simply any file in your working tree that is not in Git's index. That's all there is to that, except for one extra feature.

.gitignore

Files that aren't in Git's index will, by definition, be omitted from the next commit. That's clear now: we know that Git makes the new commit from what's in Git's index. If the file isn't in the index, it won't be in the new commit.

Such a file is said to be untracked. Running git status reports on this file. But sometimes, an untracked file is untracked on purpose rather than by mistake. For instance, if we use Python, Python builds *.pyc files and/or *.pyo files (Python3 puts them in folders named __pycache__, which keeps down the clutter a bit, while Python2 drops them all over the working tree). We don't want to commit these files (it's a bad idea in general). But we don't want git status to whine about them either.

To make git status shut up about untracked files, we can list the files' names or patterns in one or more .gitignore files. These tell Git: When I run git status and these files are untracked, don't whine! This does not make the files untracked! It is the fact that the files aren't in Git's index that makes the files untracked. If a file is tracked, listing it in a .gitignore has no effect. This fact only makes sense when you understand what the index is doing.

Besides making git status shut up, listing file patterns in .gitignore allows us to use a short-cut: we can run git add . or git add * or git add --all, and Git will not add these files if they are untracked. That is, we can use en-masse "add everything", and it will only add untracked files as new files in the index if they would be complained-about.

What this means is that .gitignore is the wrong name for this file: it should be called .git-do-not-complain-about-these-files-when-they-are-untracked-and-when-they-are-untracked-and-I-use-an-en-masse-git-add-command-do-not-add-them-to-the-index-either, or something like that. But do you really want to type in this file name? (No, you don't.) So .gitignore it is.

  •  Tags:  
  • git
  • Related