Home > Software design >  Git lists files as changed but there are no changes
Git lists files as changed but there are no changes

Time:09-30

This is the umpteenth version of the extremely basic question "why the heck is Git telling me that files changed but diff shows no changes?". Similar questions have been posted here and here but none of those answers help.

My scenario is as follows:

I added a .gitattributes file to an existing Git repo with several already existing commits in it. The content of the .gitattributes file looks as follows:

* text=auto

*.bat text eol=crlf
*.cmd text eol=crlf
*.ps1 text eol=crlf

*.sh text eol=lf

*.csproj   text eol=crlf
*.filters  text eol=crlf
*.props    text eol=crlf
*.sqlproj  text eol=crlf
*.sln      text eol=crlf
*.vcxitems text eol=crlf
*.vcxproj  text eol=crlf

*.cs        text
*.config    text
*.jmx       text
*.json      text
*.sql       text
*.tt        text
*.ttinclude text
*.wxi       text
*.wxl       text
*.wxs       text
*.xaml      text
*.xml       text

*.bmp binary
*.gif binary
*.ico binary
*.jpg binary
*.pdf binary
*.png binary

After adding that file I executed the following command:

git rm --cached -r .
git reset --hard

The result is that Git git status now shows most of the files in the Git repo as modified. However, I cannot see any changes in any of those files. The diff tool isn't showing any changes, neither in the text view nor in its hex view.

The repo has been created on a Windows machine and I'm currently using it on a Windows machine. The output of the command git config --list is as follows:

http.sslbackend=schannel
diff.astextplain.textconv=astextplain
credential.helper=manager-core
core.autocrlf=true
core.fscache=true
core.symlinks=false
core.editor="C:\\Program Files\\Notepad  \\notepad  .exe" -multiInst -notabbar -nosession -noPlugin
pull.rebase=false
credential.https://dev.azure.com.usehttppath=true
init.defaultbranch=master
user.name=My Name
[email protected]
core.autocrlf=true
core.eol=crlf
diff.tool=bc
difftool.bc.path=C:/Program Files/Beyond Compare 4/bcomp.exe
difftool.bc.cmd="C:/Program Files/Beyond Compare 4/bcomp.exe" "$LOCAL" "$REMOTE"
difftool.bc.prompt=false
merge.tool=bc
mergetool.bc.path=C:/Program Files/Beyond Compare 4/bcomp.exe
mergetool.bc.cmd="C:/Program Files/Beyond Compare 4/bcomp.exe" "$LOCAL" "$REMOTE" "$BASE" "$MERGED"
mergetool.bc.keepbackup=false
mergetool.bc.trustexitcode=true
core.repositoryformatversion=0
core.filemode=false
core.bare=false
core.logallrefupdates=true
core.symlinks=false
core.ignorecase=true

So the magic switches core.autocrlf and core.eol are as they should be for Windows as far as I could decrypt from the documentation.

Does anyone have a clue what Git landmine I've stepped on here?

CodePudding user response:

There are multiple possibilities here, but the most common by far has to do with these CRLF line endings. It's complicated, and to really get it, we need some background first.

From a high level point of view, Git basically has two options:

  • Don't mess with line endings ever.
  • Do mess with line endings.

The first one is really simple, and is the default on all Unix-like systems. It's probably the default on Windows too, but I don't use Windows, so I'd have to defer to anyone else who says otherwise. In this setup, if you create a file and store, in that file, the byte-sequence:

h e l l o CTRL-M CTRL-J w o r l d CTRL-M CTRL-J

and then git add the file and run git commit, Git will store, in the repository, a new commit in which that file contains those 14 bytes. The blob hash ID will be:

$ printf 'blob 14\0hello\r\nworld\r\n' | shasum
23eb407b644b0e362fa224168ecd0adfa02b022a

This file has CRLF line endings. Extracting the commit will produce a file with CRLF line endings. The file in the repository is now read-only, frozen for all time; it has blob hash ID 23eb407b644b0e362fa224168ecd0adfa02b022a, as does every file in any Git repository anywhere in the universe, as long as that file contains exactly that text.

Now suppose, having created this file (or not), we turn on the "do mess with line endings" options. We now get numerous sub-options, specifying just how Git will go about messing with line endings, when, on which files. These include eol=crlf, eol=lf, text, binary, and so on:

*.bat text eol=crlf
*.sh text eol=lf
*.jpg binary

This fragment tells Git that if the file's name ends with .bat, Git should mess with line endings in one particular way; if it ends with .sh, Git should mess with line endings in another particular way; and if it ends with .jpg, Git should not mess with line endings.

We know that the binary specification means that for such files, Git doesn't mess with line endings. This is good since, for instance, .jpg files do not actually have lines in the first place, so that anything that resembles a line ending is just coincidence. When Git isn't messing with anything, it's all easy: Git is storing what's there and showing you what's stored.

But that's no longer true for the other files. Since Git is now messing with their line endings, it becomes important to ask and answer more questions:

  • When exactly does Git mess with the line endings?
  • What exactly does Git do when it does this messing-about?

This is where things get complicated. The key to understanding things here is to know about Git's index. This thing—this "index"—is central in Git and you really do have to know about it to use Git properly, so let's take a tour of the index.

Git's index

Git's index is either so important or so poorly named (or both) that it actually has three names. It is also called the staging area, which refers to how you normally use it, and it is sometimes called the cache. This last name is pretty rare these days: you mostly see it in flags like git rm --cached. (Some commands, like git diff, have both --staged and --cached, with the same meaning. For some reason no one has gotten around to adding git rm --staged yet. I thought that would have happened by now, and I still think it will happen someday.)

The index does a bunch of things for Git, but here we really care about what it does for—and to—you. What it does for you is hold your proposed next commit. Git is, fundamentally, not about files, but rather about commits. Each commit holds files: in fact, each commit has a full snapshot of every file. (Each commit also has some metadata, such as the name and email address of the commit's author, but we'll skip that here.)

The thing about commits, though, is that they're purely read-only. You can make new ones, but you can never change any existing commit. The git commit --amend command, for instance, fakes it: it does not change the existing commit, it makes a new one and stops using the old one in favor of the new one instead. When you can't tell the difference—and sometimes you can't—this is just as good. When you can tell the difference—and sometimes you can—the cracks show through.

But if you can't change a commit—and you can't—and if, as is also true, the files inside a commit are in a special, compressed, de-duplicated, Git-only form that no programs other than Git itself can even read in the first place, how can you use the files that are inside a commit? The answer is simple enough: In order to use a commit, you have to have Git extract that commit first. We run git checkout or git switch to achieve this. Git extracts the files from the commit, placing usable version of them in our working tree or work-tree, where we can see them and get our work done.

Git could stop here, with committed files—read-only inside the current commit, frozen for all time—and working files. Other version control systems do stop here. But Git doesn't. Instead, as it's extracting the commit, Git puts "copies" of each file into Git's index.

I put "copies" in quotes here because the files in Git's index are stored in the internal, compressed, de-duplicated format. Since they were just extracted from some commit, they take no space: they're de-duplicated away. They hold the same data in the index that they hold when they're inside the commit: this data is frozen for all time.

What's special about the index "copies" of files is that, unlike the committed copies, you can replace them. The git add command tells Git: compress and de-duplicate the working tree file. Git reads the working tree copy, compresses it, and checks to see if the compressed result is a duplicate of some existing file in any existing commit. (This is where that blob hash ID trick comes in: it's why any file consisting entirely of hello\r\nworld\r\n has hash ID 23eb407b644b0e362fa224168ecd0adfa02b022a.) If this is a duplicate, Git puts the duplicate's hash ID in the index. If it's not a duplicate, Git arranges to store a new blob in the object database,1 and stores the new blob's hash ID in the index.

Either way, after this update-the-index step, the proposed next commit is now updated. The file you git add-ed is now staged, and git status will compare the staged hash ID to the current-commit hash ID and say staged for commit if these hash ID's don't match. (This means that git add-ing a file that's been turned back to match the committed copy takes away the staged for commit message, even though the file will in fact be in the next commit. It's just that the hash IDs now match!)

So, Git's index holds this proposed next commit. To make a new commit, you:

  • futz with the files in your working tree;
  • run git add on them to copy them back into Git's index; and
  • run git commit to package up whatever is in Git's index right then.

This is why you have to keep git adding a file each time you change it: Git doesn't automatically copy the working tree file back into the index. Git only copies it back when you say to do that.2

The end effect—and what you should take into the next section—is that, at all times, Git has three copies of each file:

  HEAD         index      work-tree
---------    ---------    ---------
README.md    README.md    README.md
img.jpg      img.jpg      img.jpg
main.py      main.py      main.py

for instance. The work-tree version is the one you can see, read, write, feed to a JPG viewer, run with the Python program, and so on. The other two are for Git: the HEAD version is the frozen-for-all-time copy from the current commit and the index version is the malleable-but-frozen-format copy, ready to go into the next commit.

  • The git checkout or git switch command switches to some commit, copying the files out of the commit to Git's index and then to your working tree.
  • The git restore command reads a file from somewhere—a commit or the index—and writes it to the index and/or your working tree based on the -S (write to staging) and -W (write to work-tree) options.
  • The git reset -- file command reads a file from Git's index and writes it to your working tree. (The -- here is a precaution, in case the name of the file is, say, master or dev or something that resembles a branch name).
  • The git add file command reads a file from your working tree and writes it to the index.
  • (Lots of alternatives are not listed here.)

So all these various commands are tricks for manipulating the index and/or working tree copy, in preparation for making the next commit (since Git is mostly about making new commits, while keeping all the old ones).


1Git actually stores the new compressed blob object immediately, even if it winds up being replaced before you make a new commit. This is okay (if perhaps sub-optimal in certain peculiar situations) because Git will run git gc for you now and then. Certain older Git versions had a bug where git gc didn't get run often enough, and this could actually be a problem, but that's been fixed for years now.

2Using git add -u tells Git to find modified working tree files, and add them, which automates the job. Using git commit -a is a lot like running git add -u && git commit: it runs a git add -u step before the commit. However, -a complicates things a bunch, and interacts badly with poorly-written pre-commit hooks, so it's kind of a bad idea. Try not to rely on it: use git add -u instead, in case you have one of these bad commit hooks. Or, learn to love the index, which lets you play clever tricks like git add -p, although this too interacts badly with poorly-written pre-commit hooks.


How and when Git messes with line endings

If:

  • Git is told to mess with line endings, and
  • a file is marked text, so that Git will mess with this file, or the text=auto setting is being used and Git guesses that this file is text

then:

  • Git will optionally mess with the file's bytes on the way from index to working tree (checkout or switch, restore, various kinds of reset, etc), and
  • Git will mess with the file's bytes on the way from working tree to index (add, mostly).

What messing-about will Git do? That depends on the eol= setting:

  • eol=crlf: On the way out, Git will change LF-only to CRLF. If a line reads hello\n in the index, Git will write hello\r\n to the working tree copy. On the way in, Git will change CRLF to LF-only. If a line reads hello\r\n in the working tree copy, Git will write hello\n to the index copy.

  • eol=lf: On the way out, Git will do nothing to the file. On the way in, Git will change CRLF to LF-only.

That's it—that's all Git will do! It won't ever change LF to CRLF on the way in, for instance. In that sense, we could say that Git "prefers" LF-only line endings. (If you want something fancier, you can write clean and smudge filters, which also operate on data "on the way in" and "on the way out" respectively, and here you can do whatever you like. But the built in stuff inside Git is limited to these few CRLF options.)

There's one more tricky bit: Git tries hard to optimize not making copies, in or out, of the index and working tree. This attempt usually works right, but it fails (by not making copies when it should make copies) if and when you switch around whether and how Git should mess with line endings. The tricks you linked to, where you rm .git/index for instance, are mostly ways to get around this. This forces Git to copy data, even in cases where Git thinks it doesn't need to copy data, even though the changed status of a file (from -text to text, or eol=lf to eol=crlf, or whatever) means that Git does have to copy.

This is all that you need to memorize. The remaining details can be worked out.

Consequences

Suppose you have a repository in which, in every commit that has text files, all committed copies have LF-only line endings. Since this is, in effect, Git's "preferred" format, the files are already all "OK". If you choose to have Git mess with files, all future commits will have LF-only line endings too, and the future commits will match the existing commits.

But suppose you have a repository in which some or all text files are committed with CRLF line endings. These commits are frozen for all time! You literally cannot change them. They will continue to have CRLF line endings. If you now begin choosing to have Git mess with files, future commits will gradually, or suddenly all at once, have some or all files with LF-only line endings, as stored in the repository.

Regardless of which of the above statements about the existing repository are true, your settings, should you set them, will affect how you see the files in your working tree, because to get into your working tree, Git has to extract the files from commits. But your file viewers might not show you what the ends of lines look like. That is, if your preferred file viewer displays a CRLF line and an LF-only line as identical, they'll look identical, even when they aren't.

The fact that the ends of lines "change" can make a change that Git considers a change. If the existing commits in the repository have CRLF line endings, and you start having Git mess with line endings, it's a good idea to do one "normalizing" commit. You will become the owner of every line of every file that is changed this way but git blame, at least, has a way to "skip over" a specific commit, if you need to figure out where some code came from. Since this "fix all files, but no real changes" commit doesn't do anything except normalize these lines, you can tell git blame to skip over it.

Note that Git (and git diff) do consider these lines different, unless you tell git diff to ignore certain white space changes:

  • --ignore-cr-at-eol: Ignore carriage-return at the end of line when doing a comparison.
  • -w, --ignore-all-space: Ignore whitespace when comparing lines.

(There are others; this is just a partial list.)

Other items that should be mentioned here

When Git commits a file, it stores both the file's data and its "mode". Git has two modes for files, which it calls 100644 and 100755 when it shows them, but for which git update-index has a --chmod option that it spells -x and x respectively. This tells Git that on a Unix-like system or any other system that has an equivalent, the 100755 or x file should be marked executable at checkout.

Most Windows file systems currently don't have an equivalent. In this case, Git tries to retain the chmod setting from the existing checkout. The rm .git/index trick defeats this "retain the old setting" trick. So it's possible to change the mode of files when fixing end-of-line issues. This is why it's better to use git add --renormalize after changing CRLF line endings settings, if your Git supports this.

The general idea that there are some changes, or features of files, that are invisible or hard to see is a little weird, but we have non-computing examples: for instance, in fine typesetting, we have the hyphen (-), the en-dash (–), and the em-dash (—). These may or may not display on your computer as different width dashes. We have other computer examples, such as the Whitespace programming language or the terrible mistake with makefile syntax (where tabs are significant). And, in spycraft—whether or not we use computers—we have steganography.

  • Related