Confused about how to handle line endings in git repo-CodePudding

I have some basic knowledge about core.autocrlf and text in .gitattributes Still I had stuck with my scenario.

Scenario

In all my editors I use CR LF, so when I modify a file it will be saved with CR LF endings. However I must regularly update files in my repo from external source (I mean copy over) and those files are using LF line endings.

What I've tried

I thought core.autocrlf=true will work for me, but unfortunately here is the trap:

My issue is that all over copied files show always changes, regardless if there is a real content change, because of the line ending changes from CR LF -> LF

If I set core.autocrlf=input then no false changes after over copy, but if I edit (even adding a space) and save a file then undo the edit and save again, it automatically changes to CR LF (by my editors) so also seems as a change in git without no real content change...

Question

What would be the workflow to maintain this repo, and always keep commit changes clean, meaning only real content changes in file with clear diff information?

CodePudding user response：

git config core.autocrlf false remains the sensible option.

I usually add a .gitattributes, with, for instance:

*.bat   text eol=crlf

That way, the files are properly converted on git commit.

Since the .gitattributes file is part of the code base, it will be used by any user cloning the repository.

CodePudding user response：

If you ever turn on CRLF modification, Git will end up with a "preferred" internal encoding in which saved files have newline-only (LF-only) line endings, over time.

If you don't want this to happen, and want to keep all your files with CRLF line endings both inside commits and when extracted from commits, don't ever turn on any CRLF modification. Whenever you get a file from somewhere else, modify it to have CRLF line endings before you ever let Git see it.

Otherwise, you'll want to make sure that Git itself stores all files with LF-only line endings, and arranges for text files—but not any other files—to have CRLF line endings when extracted from the repository. To understand this at a high level, remember the following items:

Every commit stores a full snapshot of every file. However, these files are not in the same format that your computer uses. Instead, they're compressed, Git-ified, and de-duplicated. The de-duplication in particular is useful because many commits have mostly the same files as a previous commit, and this means Git only stores one copy of the file internally.
This means each commit acts like an archive (a tar or zip file for instance). The stuff that's inside any existing commit can never be changed: not even Git itself can change it.¹ But this also means that none of your other computer programs can even read the file. Literally nothing can write it and only Git can read it. So the archives are only good as archives: you can't use them to get any new work done.
Hence, to get new work done, Git has to extract the archived files.

It's the extraction process that will change LF-only line endings, as stored inside the commits, to CRLF line endings. From now on, Git's files will be stored internally, in this compressed and de-duplicated form, as LF-only line endings. (Any existing copies that have CRLF line endings cannot be changed and are therefore still CRLF endings, but new copies—not the re-used old ones, but newly added copies—will be LF-only.)

This is why turning CRLF modification on is disruptive: all the existing commits are, presumably, CRLF-only, and all new commits will prefer to save files as LF-only. The new commits will re-use the old files if you have not touched the working tree file at all, but will change the files if you have touched the working tree file (because they'll strip the CR-s from the CRLF pairs). You can choose to "rip the band-aid off all at once" (re-add every file, with git add --renormalize or similar) or let it happen over time, but one way or another Git will keep thinking you changed every line of every file, because you will do that.

Let me pause for the footnote, and then get into specifics.

¹The reason for this has to do with the way Git stores the file data internally, representing it as a hash ID. If anything changes, the hash ID changes too, so this is now a new, additional file: the old file is still there, with its old hash ID.

Specifics

I thought core.autocrlf=true will work for me ...

It can. I personally think this is a bad idea, because the "auto" part of this means please guess about which file are text and which are not. I don't like my software to guess. Git's gotten pretty good at guessing over the years, but it can still guess wrong now and then.

... If I set core.autocrlf=input ...

From a high-level point of view, again, we need to remember that the commits are Git-ified, and whenever Git is doing CRLF line whacking, Git will tend to store files as LF-only. But it's time here to talk low-level rather than high-level.

There are two critical steps here:

Extracting a commit: Here, Git has to read the committed files (in the Git-only, read-only, compressed-and-de-duplicated form). These files aren't normal files at all. It's easy for Git to replace LF-only with CRLF while doing this expansion.

The result of expansion is your working tree copy: an ordinary file, that your computer can actually use. These are the files your programs will see. Note that this step occurs during, e.g., git checkout or git reset --hard when putting a file back to the way it is found inside a commit.
Compressing a file to get it ready to store into a new commit. Here, Git has to read the working tree copy of the file. Git has to compress it down and check to see if it's a duplicate. The compression stage is already checking the file byte by byte, so it's easy for Git to replace CRLF with LF-only during this step.

The result of compressing is either a duplicate (complete with CRLF line endings, if Git didn't f— with them and the working tree copy had CRLF line endings, or LF-only line endings if Git did), or it's an all new file. If it's a duplicate, Git links to the existing copy. If it's all-new, Git stores it away, ready to be committed.

Note that this step occurs during git add. There's very little other than git add that re-compresses a file.² But if you run git add on a file that you have not touched since you checked it out, Git will notice that you have not touched it, and will often skip the re-compression step. So you have to defeat this optimization with --renormalize, or touch the file, to get Git to really re-compress.

The only time Git actually does mess with line endings is during either of these steps. Using input as a setting tells Git: Don't do the LF-to-CRLF during extraction, but do do the CRLF-to-LF during compression. This is only useful if you intend all stored Git files to be LF-only forever, and all checked-out files to stay LF-only: in other words, it lets you see if there are committed files that have CRLF endings, as they'll wind up with CRLF endings in your working tree.

Instead of any of the auto settings, you can use .gitattributes controls to tell Git which files to f— with and what to do when it does that. The settings here are a bit finer-grained:

*.sh text eol=lf
*.bat text eol=crlf
*.jpg -text    # or binary

for instance tells Git that files named with .sh endings are to be messed with (are text), and should have LF-only line endings: Git will do the messing-with during git add, but do nothing during git checkout. Meanwhile files with .bat endings are to be messed with and should have both extract and add steps get messed-with, so that the working tree copies have CRLF line endings but the committed copies have LF-only line endings, and *.jpg files should never be messed with.

The various auto settings mean: Git, please guess which files are text and which ones are binary, and use the OS to decide how to mess with text files. Leaving out .gitattributes and not setting the auto settings (or setting them to false) means Git, don't mess with my files!

²There are some low-level plumbing commands, such as git hash-file, that can do it too.