.gitattributes settings are being ignored on commit?-CodePudding

I may have a confused understanding of .gitattributes and please direct me to or tell me if that is true. I thought part of this system was to inform git on how to translate lines endings when taking files in and out of storage on the local system.

Somehow in one of my repositories there are repeated problems with improper MAC OS style "CR" line endings being committed and pushed into the repository for .xml. My only .gitattributes at the root does have *.xml text in it so I would be expecting git to always be storing all the lines of edited .xml files in LF no matter what.

There is generating software that I use which updates some of these xml files is putting out CR line endings on modified lines and I can't do much about that. I would expect that the commit guided by .gitattributes would correct this problem.

Yet checkouts on another windows machine still lead to it coming out in CR???

Am I misguided or is there something going wrong?

Below is from the .gitattributes:

###########################################################
# Common settings that generally should always be used
# with your language specific settings
###########################################################
# package NOTHING for the source zip downloadable
* export-ignore
# Auto detect text files and perform LF normalization
* text=auto

#
# The above will handle all files NOT found below
#

# Documents
*.doc  diff=astextplain
*.DOC  diff=astextplain
*.docx  diff=astextplain
*.DOCX  diff=astextplain
*.dot  diff=astextplain
*.DOT  diff=astextplain
*.pdf  diff=astextplain
*.PDF  diff=astextplain
*.rtf  diff=astextplain
*.RTF  diff=astextplain
*.md  text
*.adoc  text
*.textile  text
*.mustache  text
*.csv  text
*.tab  text
*.tsv  text
*.sql  text
*.htm  text diff=html
*.html text diff=html
*.xml  text
*.xhtml text diff=html
*.json text
*.txt text

CodePudding user response：

I may have a confused understanding of .gitattributes and please direct me to or tell me if that is true. I thought part of this system was to inform git on how to translate lines endings when taking files in and out of storage on the local system.

Yes, sort of. It's the sort of that will get you here—that, plus the complicated (because of historical mistakes) rules for doing said translations.

To understand what goes on, you must understand multiple things simultaneously (this is often the case with Git, which is very self-referential: you can only understand Git once you understand Git...).

A Git commit is permanent—well, mostly permanent, but we won't cover how to get rid of one (it's hard in general)—and unchangeable (completely unchangeable; not even Git itself can alter a commit once it's made). Note that git commit --amend is a sort of lie, albeit a useful lie, which we again won't cover here, but it makes heavy use of that mostly permanent trick. If you imagine commits as bricks in a wall, the "topmost" (most recent) commit is often still a bit loose and can be whacked off the top of the pile. This gets a lot harder once you've stacked some more bricks on top.

Anyway, with that in mind, let's consider that each commit holds a full snapshot of every file, in this special, read-only, Git-only, compressed and de-duplicated format. The de-duplication means that the repository doesn't grow enormously fat even though every commit stores every file, because in general, we make a new commit by changing just a few files, or adding one or two new ones, or something, and mostly re-using the old files. The re-used contents are duplicates, and by being de-duplicated, take no space, even though they're at least logically stored in each commit. (They're just not physically stored there: this is enabled by the read-only-ness. Note how Git depends on Git here. The de-duplication trick only works because commits are unchangeable, and commits are unchangeable because this enables the hashing tricks that Git use, and because this enables re-using file content, and so on.)

Well, that's all fine, but there's an obvious problem. We'd like to use our commits to actually accomplish something, and we can't do that with unreadable read-only files. So the files stored in the commits, as snapshots, are utterly useless! Well, OK, not quite utterly useless: Git can read them. So we have Git read them out and expand them into ordinary everyday files.

Now, in a system that isn't Git, this expanding-out would happen with all the files in your commit and you'd now have a usable version of the file. If Git is doing line-ending fiddling, this would be where it happened: the usable version would have your OS's line endings, whatever those are.

When you go to make a new commit from these files, Git would have to re-compress and re-Git-ify every file. This is also where Git would do line-ending-fiddling, if it's being done at all; the Git-ized version of each file would have the Git LF-only line endings.

Git isn't like these other systems

When you check out a commit with Git, Git does have to—at least initially—expand out the internal form of each stored file. But (and this is a very big but) Git also stores a pre-compressed, pre-de-duplicated copy of the file at this time. Since Git just got this file out of a commit—which holds the compressed, de-duplicated copy—this "store a de-duplicated copy" is the same kind of really trivial thing that Git does with a commit that re-uses a file. It takes no space,¹ and almost no time at all, to make this extra not-really-a-copy-at-all "copy".

It's this internal-format version of the file, stored in Git's index, that Git is going to use to make the next commit. So as Git checks out some commit, if Git is doing line-ending-fiddling, Git fixes up the working tree copy of the file to have the line endings you asked for, presumably to make your OS happy. But Git literally does not touch the committed version of the file. Whatever line endings it has, it has: that copy is read-only and can never be changed, and that copy has been "copied" to Git's index.

If you don't touch the working tree copy of the file and explicitly run git add on it, Git is going to leave this index copy alone. When you run git commit, Git is going to make the new commit from whatever is in Git's index at that time. So, as soon as some "wrong" line ending gets into a commit, it continues to propagate into future commits, until you take explicit action to stop this.

That's the first key: once something has gone wrong, it stays wrong until and unless you fix it. You can fix it by accident, by modifying some file and running git add on it, or you can fix it on purpose. Since Git version 2.16.0, there is now git add --renormalize to force Git to update index copies of files according to the current .gitattributes rules. This leads to a simple recipe: if you change the .gitattributes rules, run git add --renormalize .. It's generally a good idea to do this as a separate commit (change .gitattributes, add with renormalize, and make sure you like the result just before or after you commit it—again, "just after" a commit the brick is loose and easily removed with git reset, no harm no foul).

Now, there's nothing wrong with having recipes (memorized or printed out or whatever), but don't just rely on that. In particular we haven't yet covered the other, and more likely, issue.

¹The index entry that goes into Git's index / staging-area does take some space. The amount of space needed per file varies depending on the file name length, but tends to be about 100 bytes per file on average.

End of line translations

Git has exactly two kinds of end-of-line translations built in, and has had this ever since EOL-hacking became part of Git.² To keep things simple—or as simple as possible anyway—Git can only do these two things:

On the way out of the internal, compressed-and-de-duplicated format, Git can change \n-only (LF-only) to \r\n (CRLF) line endings.
On the way into the internal format, Git can change \r\n (CRLF) line endings to \n-only (LF-only) line endings.

(Note that Git doesn't promise anything useful in the case of, e.g., \r\r\n\n byte sequences: you might expect this to become \n\n but in fact it becomes \r\n\n, at least in some versions of Git, on the first pass through the encoder. After a round trip it will become \n\n. Don't try to use the line ending conversion as a general tool. Use a general tool for general work, and see footnote 2 again.)

The points at which Git does these operations—expand an internal blob object to usable text, and compress a working tree file down into an internal blob object—are where Git does the conversions, and these are mostly git checkout and similar for "expand" and git add for "compress". Note that there are a lot more forms of checkout (e.g., git restore, git reset, and the like) than there are of git add (the git add command is mostly it here), and recently, the git cat-file plumbing command has acquired the ability to do these text conversions as well, but they're very special-cased here. The thing to remember as a user is that:

git checkout <commit-or-branch>
git switch [--detach] <commit-or-branch>

do the expansion with all the files they're extracting, and:

git reset ... -- path
git restore ... -- path

do the same expansion but with just the specified files. They use the file's path name to look up the right rules in .gitattributes. So this is when you'll get LF-only to CRLF conversions.

While there are tricks like git add --patch, you mostly git add an entire file or use an en-masse "add everything" like git add . or git add -u, and it's easy to see that git add has the path name and can look up the right rules in .gitattributes here as well. So this is when you'll get CRLF to LF-only conversions.

For these conversions to happen, you must enable them.

²I'm pretty sure Linus was against this from the start and still is, in the same vein as I am: version control systems should store what you put in them, not do all kinds of crazy file format conversions; those should be separate tools. Alas, Git is acquiring crazy file format conversions, including working-tree-encoding stuff.

The complex rules in `.gitattributes`

We need to keep in mind that there are, in the end, one to three separate rules for each file, depending on how you want to count these:

Do we mess with the file at all? That is, is this a "text" file, or is this a "binary" file? We only mess with "text" files: binary files are pure (pure of heart?) and must be untouched.
If we do mess with the file, when do we mess with it? Do we do LF-to-CRLF on output? Do we do do CRLF-to-LF on input? These, too, are separate rules.

The text directive says that the file is definitely text. The -text directive say that the file is definitely not text. The text=auto directive, if you use it, instructs Git to guess whether the file is text.³

There's a "macro directive" spelled binary that means -text -diff -merge. This includes -text, meaning not text, but also two more things. The -text alone suffices to ensure that the file won't be messed with. The other two things are just for diff-ing and merge-ing, separate from the "mess with / don't mess with" flag.

There are eol= directives. These set which of the two conversions Git will do, and (according to at least some Git documentation versions) implies text. (One thing that is not clear to me is how this acts if you also set an explicit text=auto for the same files.)

There are core.autocrlf directives and core.eol attributes, which aren't in .gitattributes at all. These can turn on text=auto and fiddle with the eol= settings. And, there's crlf, -crlf, and crlf=input, which are from older versions of Git and mean text, -text, and eol=lf respectively.

Consult your own Git installation's gitattributes documentation because this part of Git has changed frequently, and had frequent small bug-ettes. It's very difficult to say exactly which combinations produce which results in which Git versions, if you combine multiple different options.

³The guessing is based on the content of some front portion of the file content. I'm not a big fan of such guessing: it's virtually guaranteed to get it wrong in at least some case. The current "is binary" algorithm uses three tests:

Is there a null byte in the portion we checked? Yes = binary
Is there a lone CR (CR without LF) in the portion we checked? Yes = binary
Are the number of "non-printable" bytes at least 128x the number of "printable" bytes? Yes = binary

If none of the three tests say "binary", the file is "text". "Printable" bytes are those that are at least 32 (space) and less than 127 (ASCII DEL), code 128 to 255, and backspace, horizontal tab, ESC, and formfeed. "Nonprintable" include the NUL, DEL, and codes in 1-31 that are not excepted specifically by other rules or by being CR or LF. A control-Z (DOS-style EOF) at the end is considered neither printable nor non-printable.

Git isn't like these other systems

End of line translations

The complex rules in .gitattributes

The complex rules in `.gitattributes`