Once you run git init on a directory, git knows about all files that exist in that directory and below. These are untracked files. I'm familiar enough (ish) with what happens when you stage those files and they become "tracked" files. But what I want to know is, how does Git find the untracked files? Does git run the tree command every few minutes or on fixed intervals? Does it run the tree command every time you type git status, or git add? Or every time you save a file? Is something cleverer going on?
So my questions in short:
How does Git find the untracked files?
How often does Git search for "untracked" files?
CodePudding user response:
This used to have a simple and easy answer, 5 or more years ago. Now, it's ... much more complicated. However, the behavior today is supposed to match that of yore, so if you don't care about the actual implementation—and you're not supposed to have to care—we can describe the old implementation.
Once you run git init on a directory, git knows about all files that exist in that directory and below.
That's not actually true: git init
will create the .git
directory / folder and populate it, but at this point Git hasn't looked in the working tree yet, other than to see if there's a .git
there (so that git init
in an existing repository "reinitializes" it, rather than creating a new repository—the reinitialization step usually does absolutely nothing, though technically it's defined as copy template hooks more or less).
It's true, however, that any existing files in the working tree are now untracked files:
These are untracked files. I'm familiar enough (ish) with what happens when you stage those files and they become "tracked" files. But what I want to know is, how does Git find the untracked files?
Logically, Git does this simply by reading, laboriously, one entry at a time, through the top level of the working tree:
DIR *dirp = opendir(worktree);
in C code, followed by a loop calling readdir
.
Each entry that comes back has several items. If we assume the variable holding the struct dirent *
pointer is named dp
, we at least two or three fields:
dp->d_name
dp->d_namlen
dp->d_type
The d_type
field can contain DT_UNKNOWN
, meaning that the system did not provide a file type, or DT_REG
or DT_DIR
or DT_LNK
as three possibilities meaning "regular file", "directory" (folder), or "symbolic link". (Other values are also possible on Unix-like systems, but Git won't store such entries.)
If the type is DT_UNKNOWN
, Git will need to call lstat
on the file's name, as constructed by combining the original worktree
path with a slash and the C string in dp->d_name
(it's NUL-terminated despite also having a d_namlen
field). That lstat
call should succeed, and return all the usual information from any stat system call. Note that lstat
calls are very expensive (much more so than opening and reading a directory, in general) so we want to avoid them whenever possible, and once done, we want to keep that information as long as possible. So Git tends to do this; see more below.
Git can now look in .gitignore
(or a data structure built by reading .gitignore
) to see if a directory (DT_DIR
or S_ISDIR()
) should be skipped. If so, this short-circuits all the recursion. If not, for each directory, Git must do all of this same work recursively, but see below.
For things that aren't directories and can be stored in a repository, Git can now test to see whether this entry is already in Git's index—note that directory entries are not stored as index entries of this sort, but the index has "extensions" that can store them; we'll see more about this below—or not. If a file found by this process is in Git's index, it is a tracked file. If a file found by this process is not in Git's index, it's an untracked file. That's pretty much all there is to tracked vs untracked. Note that the tracked file, with information now stored in Git's index, has its lstat
data transcribed as well, if we've done lstat
on it.
Does git run the tree command every few minutes or on fixed intervals?
(By "tree command", I'll assume you mean the logical read-tree-recursively operation I just described.)
Neither. It scans the top level working tree on demand.
Does it run the tree command every time you type git status, or git add?
For git status
, yes. For git add
, only if you've used one of the "all" options, or asked git add
to add an entity that is a directory, in which case it has to do the read-tree-recursively for the entity you named (provided it's not listed in a .gitignore
).
Is something cleverer going on?
Yes. The description above is the logic, but the actual implementation tries to use shortcuts. In addition, there is now a thing called the "fsmonitor daemon".
Probably the most important shortcut depends on some trickery that works well on Unix-like file systems, but maybe not on others. This is the fact that each directory has a modification time. This modification time shows up as the st_ctime
1 field of a struct stat
, if you perform an lstat
or stat
system call on the directory. We also know what time we last wrote to Git's index, as we store that in a field in the index.
Suppose we already know that path/to/dir
had, say, eight untracked files in it, and know that they were all ignored. Suppose further that we keep—somewhere—an entry for this path/to/dir
entity that stores its st_ctime
field. And suppose we can stat
it now and see that its ctime is unchanged and predates the update to Git's index. Then we can be sure that it has not acquired any new untracked files, and therefore we could re-use the old information.
Git also uses this same kind of trick to avoid rebuilding blob hash IDs for all tracked files, where it's much more important. This is easier than the untracked cache, so the tracked files (those listed in normal index entries) have had this kind of magic applied since early Git in 2005.
The untracked cache code went into Git 2.5, back in 2015, and it was kind of flaky for a long time, but it's pretty solid now. Unfortunately it depends on assumptions that aren't true on NTFS (so it's often disabled on Windows systems).
Git gained something called the fsmonitor in 2017, as part of Git 2.16, and it too has been flaky. There has been a lot of work on it recently to make it less flaky and to make it work on systems other than Windows: specifically, it now has macOS and Linux implementations. The fsmonitor code is very different from the above: it listens for file system operations reported by the OS, which may occur in directories that aren't even in the working tree, and on some systems, some events might be dropped if the event queue fills up.
The job for the fsmonitor is basically to make it faster to update Git's index. The details are still changing, so I won't attempt to describe it any further.
For technical information about what's in Git's index, see the Documentation/technical/index-format.txt
file in the Git repository for Git.
1You'd expect this to be st_mtime
, and it would be, except that st_mtime
can be set back on purpose via utimes
or similar system calls. However, there's a second field, st_ctime
, that stores a generally non-decreasing value that can't be set back this way. So we use ctime
, the inode-change-time, instead of mtime
, the modification time. Backup software typically uses the same trick for incremental backups.