How (and how often) does Git scan the working directory for "untracked" files?-CodePudding

Once you run git init on a directory, git knows about all files that exist in that directory and below. These are untracked files. I'm familiar enough (ish) with what happens when you stage those files and they become "tracked" files. But what I want to know is, how does Git find the untracked files? Does git run the tree command every few minutes or on fixed intervals? Does it run the tree command every time you type git status, or git add? Or every time you save a file? Is something cleverer going on?

So my questions in short:

How does Git find the untracked files?
How often does Git search for "untracked" files?

CodePudding user response：

This used to have a simple and easy answer, 5 or more years ago. Now, it's ... much more complicated. However, the behavior today is supposed to match that of yore, so if you don't care about the actual implementation—and you're not supposed to have to care—we can describe the old implementation.

Once you run git init on a directory, git knows about all files that exist in that directory and below.

That's not actually true: git init will create the .git directory / folder and populate it, but at this point Git hasn't looked in the working tree yet, other than to see if there's a .git there (so that git init in an existing repository "reinitializes" it, rather than creating a new repository—the reinitialization step usually does absolutely nothing, though technically it's defined as copy template hooks more or less).

It's true, however, that any existing files in the working tree are now untracked files:

These are untracked files. I'm familiar enough (ish) with what happens when you stage those files and they become "tracked" files. But what I want to know is, how does Git find the untracked files?

Logically, Git does this simply by reading, laboriously, one entry at a time, through the top level of the working tree:

DIR *dirp = opendir(worktree);

in C code, followed by a loop calling readdir.

Each entry that comes back has several items. If we assume the variable holding the struct dirent * pointer is named dp, we at least two or three fields:

dp->d_name
dp->d_namlen
dp->d_type

The d_type field can contain DT_UNKNOWN, meaning that the system did not provide a file type, or DT_REG or DT_DIR or DT_LNK as three possibilities meaning "regular file", "directory" (folder), or "symbolic link". (Other values are also possible on Unix-like systems, but Git won't store such entries.)

If the type is DT_UNKNOWN, Git will need to call lstat on the file's name, as constructed by combining the original worktree path with a slash and the C string in dp->d_name (it's NUL-terminated despite also having a d_namlen field). That lstat call should succeed, and return all the usual information from any stat system call. Note that lstat calls are very expensive (much more so than opening and reading a directory, in general) so we want to avoid them whenever possible, and once done, we want to keep that information as long as possible. So Git tends to do this; see more below.

Git can now look in .gitignore (or a data structure built by reading .gitignore) to see if a directory (DT_DIR or S_ISDIR()) should be skipped. If so, this short-circuits all the recursion. If not, for each directory, Git must do all of this same work recursively, but see below.

For things that aren't directories and can be stored in a repository, Git can now test to see whether this entry is already in Git's index—note that directory entries are not stored as index entries of this sort, but the index has "extensions" that can store them; we'll see more about this below—or not. If a file found by this process is in Git's index, it is a tracked file. If a file found by this process is not in Git's index, it's an untracked file. That's pretty much all there is to tracked vs untracked. Note that the tracked file, with information now stored in Git's index, has its lstat data transcribed as well, if we've done lstat on it.

Does git run the tree command every few minutes or on fixed intervals?

(By "tree command", I'll assume you mean the logical read-tree-recursively operation I just described.)

Neither. It scans the top level working tree on demand.

Does it run the tree command every time you type git status, or git add?

For git status, yes. For git add, only if you've used one of the "all" options, or asked git add to add an entity that is a directory, in which case it has to do the read-tree-recursively for the entity you named (provided it's not listed in a .gitignore).

Is something cleverer going on?

Yes. The description above is the logic, but the actual implementation tries to use shortcuts. In addition, there is now a thing called the "fsmonitor daemon".

Probably the most important shortcut depends on some trickery that works well on Unix-like file systems, but maybe not on others. This is the fact that each directory has a modification time. This modification time shows up as the st_ctime¹ field of a struct stat, if you perform an lstat or stat system call on the directory. We also know what time we last wrote to Git's index, as we store that in a field in the index.

Suppose we already know that path/to/dir had, say, eight untracked files in it, and know that they were all ignored. Suppose further that we keep—somewhere—an entry for this path/to/dir entity that stores its st_ctime field. And suppose we can stat it now and see that its ctime is unchanged and predates the update to Git's index. Then we can be sure that it has not acquired any new untracked files, and therefore we could re-use the old information.

Git also uses this same kind of trick to avoid rebuilding blob hash IDs for all tracked files, where it's much more important. This is easier than the untracked cache, so the tracked files (those listed in normal index entries) have had this kind of magic applied since early Git in 2005.

The untracked cache code went into Git 2.5, back in 2015, and it was kind of flaky for a long time, but it's pretty solid now. Unfortunately it depends on assumptions that aren't true on NTFS (so it's often disabled on Windows systems).

Git gained something called the fsmonitor in 2017, as part of Git 2.16, and it too has been flaky. There has been a lot of work on it recently to make it less flaky and to make it work on systems other than Windows: specifically, it now has macOS and Linux implementations. The fsmonitor code is very different from the above: it listens for file system operations reported by the OS, which may occur in directories that aren't even in the working tree, and on some systems, some events might be dropped if the event queue fills up.

The job for the fsmonitor is basically to make it faster to update Git's index. The details are still changing, so I won't attempt to describe it any further.

For technical information about what's in Git's index, see the Documentation/technical/index-format.txt file in the Git repository for Git.

¹You'd expect this to be st_mtime, and it would be, except that st_mtime can be set back on purpose via utimes or similar system calls. However, there's a second field, st_ctime, that stores a generally non-decreasing value that can't be set back this way. So we use ctime, the inode-change-time, instead of mtime, the modification time. Backup software typically uses the same trick for incremental backups.