Home > Software design >  Git diff line differentiate between deletion and insertion
Git diff line differentiate between deletion and insertion

Time:11-23

When I insert lines into file.json, git also counts the old line as deleted. For example, inserting "foo2": "bar2" into line 2:

old
1 {
2   "foo1": "bar1"
3 }

new 
1 {
2   "foo2": "bar2"
3   "foo1": "bar1"
4 }

When I run the following, filtering out only the lines starting with - and :

git diff -U0 ..origin/main -- path/to/file.json | grep '^[ -][^ -]''

This is the result:

-  "foo1": "bar1"
   "foo2": "bar2"
   "foo1": "bar1"

I get that this is how git is intended to work, but is there a way where I can filter out or avoid the old lines showing up as deleted ? I need to find only the lines that were deleted, and not replaced by insertion.

CodePudding user response:

There is a change in the whitespace: space, tab, carriage return (\r), new line (\n). git diff -w will ignore whitespace differences.

CodePudding user response:

Are you sure you inserted line #2 not #3? In the example you provided, you forgot a , at the end of the line in line #2, so maybe you in fact inserted line #3 and overlooked that line #2 got a change in form of an addition of an ending comma?

If that's not the case, then I bet the difference is in whitespaces. Check both files (before/after inserting line) very carefully with a text editor that highlights them, or with a hex editor.


Lem responded:

I double checked and this is indeed the case, git is considering the new line with a comma added as a completely separate addition, since the old line didn't have a comma. How can I rule this case out ?


Unfortunatelly, I think you can't. A change is a change. Even something easy to overlook or something easily being trivial, like a whitespace, is still a change.

There is a -w for ignoring whitespaces as Klox noted, but it's only because many times there's tabs-vs-spaces, \r\n vs \n, or just indentation issues. Machines would care, but humans don't, hence -w for humans.

But commas? it's totally a file-format dependent thing. In one programming language or data format, such addition of a comma would be critical, in another one, it's just a detail. Git can't know it for sure.

A few things that come to me as a potential solutions if you need really bad the diffs to be clean of such noise:

  • You know which cases cause the problem, so if you have control on the thing that generates them, you could somehow ensure that all lines are added in a safe non-comma way, always at the begining or middle of the list, never at the end.

  • You also know what the problem is, and it's definition is quite well defined now and also easy to find (comma followed by newline). You could run the files through a preprocessor before diffing, small tool or script that either removes all such item-separating commas, or adds a comma at the end of every list; and then diff such cleaned up files. JSON is a nice case here, as there is no possibility of having a comma-newline sequence in the actual data, because string data with a newline would be encoded as \n or \r\n not a literal newline. So a comma-newline sequence is always a list item separator (items in arrays, props in object), and never a real 'data payload'.

  • Git diff is designed to be a generic text comparison tool for many languages/etc. But it's just as good as it could. It can't have all the features any case would want to have. You can have other differs than just that default one, for example image-differs for JPG files, complex differs for Excel/Word files, and so on. They are just separate tools and don't come by default with git client instalation. You can set the git client to use another utility for generating diffs for you, some utility that knows how to diff JSONs better, a json-differ that knows the commans in such places are irrelevant. However, I don't know such utility.

  • Related