How to git know if a file is staged or commited?-CodePudding

The git index file stores the hash of the file, filename along with other metadata such as created date and modified date. When we do git status how does git knows whether the file which already exists in the index is staged or committed since it stores hash of the file for both staged and commited files?

So how does git differentiate between staged and commited files?

CodePudding user response：

When tracked files in Git are changed and added to the staging area, all the objects in your staging area form a 'tree' object that collectively defines all the changes you've made to the repository. This information is stored in the index file. Yes.

When you perform a commit a commit object is created, which contains a reference to the 'tree' object in the staging area (which contains all of your changes).

Git can simply know which file is staged/committed by checking if the file exists in a tree that no commit object points to (new file) or if the file object hash changed compared to the one in the existing tree (updated file). This means the file hasn't been committed. Once you perform the commit, the file moves away from the staging area, because it now comes under a tree, which is pointed at by a commit object.

At the heart of it, is the usage of hashes to determine difference between files, coupled with the 'tree' mechanism.

You can read more about internals of Git in the official book. Here is the chapter on internals.

CodePudding user response：

When we do git status how does git knows whether the file which already exists in the index is staged or committed since it stores hash of the file for both staged and commited files?

This question fundamentally doesn't make sense. I think you're imagining Git is doing something that Git doesn't actually do.

As the documentation you linked notes, each index entry holds:

the file's full name (path/to/file.ext, but with compression depending on the index format);
the file's "mode" (100644 or 100755 for files, 120000 for symbolic links, and 160000 for gitlinks—these entries are never tree objects so 040000 simply never appears¹);
the staging number (0 to 3);
the hash ID of the file's content; and
the cache data.

That's what you see from git ls-files --stage and git ls-files --debug:

100644 2b9a0a4e22fa5256575b6b184a16fc93ee47869e 0       compat/inet_pton.c
100644 bc2f9382a17f8f63a783680770a46358964e8f81 0       compat/linux/procinfo.c
100644 56bcb4277f47c295993c853a84edfe45e6aa3911 0       compat/memmem.c
100644 2607de93af594034413e4512c4d7dd33fb133023 0       compat/mingw.c

for instance.

Git's index holds one² of these entries for every file that will be in the next commit. These are the "staged" files. That's all there is: the index holds the proposed next commit. The git add and git rm commands work by updating the proposed next commit, by changing what's in the index.

When you run git status, Git:

Reads the current commit. The current commit has a commit hash ID, which Git finds by reading HEAD, which either contains the hash ID directly, or contains the name of the current branch. If HEAD contains a branch name, Git reads the branch name's value to find the current commit hash ID.³
Reads the index.

The current commit's tree—the saved tree object that goes with the commit—plus the files in the index, tell Git which files match the current commit: that is, which files in the index are also in the current commit, and vice versa, whose hash IDs match. The proposed next commit and the current commit match, so Git won't bother saying anything about these files. But they will still be in the next commit, if and when you make one.

For files that have different hash IDs, or files that no longer exist or are new, git status will tell you that these files are staged for commit (and/or are new or are staged to be deleted). So that's where you see the staged for commit stuff: any index entry, e.g., that for compat/memmem.c, if its hash ID doesn't match the hash ID in the current commit's tree, then compat/memmem.c is "staged for commit".

Note that this is the same information that you'd get for a git diff --name-status that compares HEAD vs the index.⁴ So the staged for commit part is just the output of git diff, reformatted a bit.

Having figured out what to tell you is "staged for commit", git status now goes on to do a second git diff. This time, it diffs the index files vs the working-tree files.⁵ Once again, for every file that is in the index, but isn't in the working tree or has different content in the working tree, Git will say something about it, under the not staged for commit grouping. This particular diff is a little different for two reasons:

Git can't just compare hash IDs. Git needs to use the index hash ID to expand the file contents out, and compare that to the actual file contents. But Git can use the cache data in the index to rapidly skip over files that it can prove have not been changed. (For cases where it can't skip, git status here can be very slow. Git is full of special-case optimizations to take advantage of as many "can skip because _____ (fill in the blank)" cases as anyone's found so far, and it's constantly growing new ones, like the file system monitor stuff.)
When Git reads the working tree (laboriously if necessary—again, Git tries to cheat and skip work if possible here), and comes across a non-ignored file that isn't in the index, this file is grouped into a special category.

The special category of "file in working tree not in index and not marked ignore" logically should show up in git diff --name-status as status A, or "added", but in fact git diff just doesn't say anything at all. here To make git status more useful than plain git diff, git status collects these names up and then describes them as untracked files. The untracked files listing output will compress an entire directory's worth of untracked files as dir/ instead of listing every file, if it's allowed to do so.

¹If you try to jam one in somehow, and the attempt isn't rejected outright, the mode mysteriously changes into 160000. This is arguably a bug: if the index could hold 040000 entries, we could commit empty directories. See also How do I add an empty directory to a Git repository?

²When the staging number is nonzero, Git's index holds up to three entries per file, but you aren't allowed to run git commit at this point. The nonzero-stage entries represent a "conflicted file". You must "resolve the conflict" and run git add, which will update the index so that the nonzero-stage entries are gone and there's only one stage-zero entry recording the resolved-conflict-file content. (Or, you can git rm the whole thing so that there's no entry at all, meaning the next commit won't have the file.)

³In the slightly peculiar case of an unborn or orphan branch—Git uses these terms more or less interchangeably—HEAD holds the name of a branch that doesn't exist. An attempt to read the value of that branch name produces an internal "does not exist" error, which Git handles by saying to itself: oh, an unborn branch! To make things easier for the rest of Git, Git immediately pretends that the current commit's tree is the empty tree, so that the remainder of the code doesn't need a special case: by using the empty tree as the current commit's tree, all the files appear to be new files.

⁴You can get this with git diff --name-status --cached or git diff --name-status --staged. The --cached and --staged options here are pure synonyms. (A lot of Git commands now use either option, to help stop using the ancient name cache for the index / staging-area. Some Git commands still only have --cached as an option though, e.g., git rm.)

⁵This diff is the one you get with a plain git diff, e.g., git diff --name-status, but note the special casing for "untracked" files here: this kind of git diff doesn't bother to mention untracked files at all, while git status does.

Review / TL;DR version

When we do git status how does git knows whether the file which already exists in the index is staged or committed ...

Everything in the current commit is committed. The commit's contents are there, stored in the commit, permanently (well, mostly-permanently⁶) and quite read-only. Git needs the hash ID of the current commit to find that commit and thence to see what's in the current commit, but Git is easily able to find that hash ID by reading HEAD.

Meanwhile, whatever's in the staging area is staged. That includes files that haven't changed since git switch or git checkout initially filled the index from the current commit. So git switch somebranch has filled the index and all the files match,⁷ and if you have't futzed with the index since then, git status doesn't say anything, even though every file is staged.

⁶A commit that literally can't be found, so that you aren't sure if it even exists, might eventually go away and stop existing. How and when are outside the scope of this answer. If you can see the commit with git log or git log --all, it's still there for sure, though.

⁷If you have staged and/or unstaged changes that git status reports and you run git checkout or git switch and it succeeds—it might fail—then the switch deliberately did not update some file(s). As a simple overview, the setup is kind of obvious: every commit stores de-duplicated files, via the hash ID trick, so if we're moving from commit a123456 to b789abc and most of the files are the same, the switching doesn't have to ream out those files in the first place. But because index and working tree can be modified or not-modified separately, this actually gets pretty hairy. See Checkout another branch when there are uncommitted changes on the current branch for the gory details.