Home > Back-end >  Does Git commit takes a snapshot of the whole repository or just of the staging area?
Does Git commit takes a snapshot of the whole repository or just of the staging area?

Time:09-27

Does the command git commit takes a snapshot of my whole project or just of the staging area/index?

I already know that git commits are nodes of a graph, that every commit stores a pointer to the previous commit in the chain, that a branch is just a pointer to a commit and that HEAD is a pointer to the current branch.

However, I have been searching for an answer to this question on the Internet, but I seem to get different answers from different websites, and I can't get my head around this.

From Git pro book:

In order to begin tracking a new file, you use the command git add.

Now that your staging area is set up the way you want it, you can commit your changes. Remember that anything that is still unstaged — any files you have created or modified that you haven’t run git add on since you edited them — won’t go into this commit. They will stay as modified files on your disk.

From Git documentation

git-commit
Create a new commit containing the current contents of the index and the given log message describing the changes

As I can understand from those statements, it seems to me that git commit takes a snapshot of just the staging area.

However, looking for some answers in this sub I found:
This answer

In Git, all commits are immutable snapshots of your project (ignored files excluded) at a specific point in time. This means that each and every commit contains a unique representation of your entire project, not just the modified or added files (deltas), at the time of commit.

Everytime a new commit is created, a snapshot of your entire project is recorded and stored to the internal database following a DAG data structure.

And this answer

A "commit" in git is not a change (delta), but represents the entire state. But a commit contains more than just state: it also has a pointer to a parent commit (typically one, but can be any number), i.e. the previous commit in the repository's history.

After posting this question, I was able to find this answer too, that seems to give more details about the question:

You can have a lot of changes in a lot of files and only include in a commit a few of these changes. That does not mean that those other files are not part of the commit: technically all unchanged files are part of that commit, they are just not part of the commit diff.

From those answers, it seems to me that a commit is a snapshot of the whole state of the project at a given time.

Does somebody have a good, yet simple, explanation?

CodePudding user response:

Discussion can get seemingly sloppy when everyone's so familiar with the subject they start presuming listeners understand. When listeners do understand, communications can get briefer and briefer as the shared assumptions start compounding.

So for instance when discussing Git commits, people very familiar with Git will understand in what contexts "your entire project" is referring only to its tracked state.

CodePudding user response:

With some minor quibbles, the index / staging-area (these are two terms for the same thing) is the snapshot, as jthill said indirectly, and chepner said directly in a comment. The index contains—indirectly—the full form of every file that will be in the new commit, in the format it will have in the commit, with the following exceptions:

  • The index entry for any given file includes a stage number. This number must be zero, or git write-tree—which is the internal step that turns the index into a Git internal tree object—will complain and fail.

    (Nonzero stage numbers imply that you're in the middle of a conflicted merge, as a result of git merge, git cherry-pick, git rebase, or similar. Your job as the programmer, in this case, is to resolve the conflicts and write the correct version of the file to the index as "stage zero", which you normally do using working tree files and git add, perhaps with the help of git mergetool and/or other programs.)

  • Each index entry stores the full path name of the file, e.g., path/to/file. Note that the index has this path name with forward slashes, even on Windows where the working tree copy is generally referred-to via path\to\file by the OS itself.

  • The index includes the file's mode in a Git-ized internal format that "accidentally on purpose" more or less matches the Linux internal st_mode field, minus some data.

  • The index includes a bunch of cache data to speed things up; this cache data is not included in the snapshot.

The format of an index entry is generally compressed and rather fancy and there are multiple index version formats (they're numbered). See the Git documentation for more; note that this recently moved from an internal technical documentation directory. This isn't something normal users normally care about, but it's been officially documented so that other language libraries can work with Git index files.

Git's "tree" objects store file name components, e.g., when a file is named path/to/file, the path gets broken up at the slashes to form two tree objects path and to. The file entry then goes in the tree object for to, whose hash ID goes in the tree object for path; the hash ID for the top level tree object goes into the commit as the commit's (required, and required-to-be-singular) tree entity. That provides the commit's version of indirection, which is hierarchical, vs the index's indirection, which is flat: the index stores Git blob object hash IDs directly, along with full path names.

By storing the file contents as blob objects, Git gets instant de-duplication for duplicated file content. This also means that Git can instantly tell whether some file in commit C1 has the same content as some file in commit C2, because if those two files have the same content, they will use the same blob hash ID.

Note that (again with some quibbles), checking out a commit consists of reading its top level tree into the index and your working tree. Making a new commit consists of writing the index to a series of tree objects, ending with a single top level tree, and then writing a commit object that refers to this tree (and then updating the current branch name, if in attached-HEAD mode). The quibbles here all have to do with uncommitted work: see Checkout another branch when there are uncommitted changes on the current branch.

  • Related