I would like to show only changes to the column headers of a csv file tracked by git. I use the code in this nice answer by Kirill Müller. It works almost perfectly except that it repeats the lines even if the commit didn't actually change the first line of the file.
Reproducible code
cd /tmp/
mkdir test
cd test/
git init
echo "bla,bla" > table.csv
git add table.csv
git commit -m "version bla"
echo "bla,bli" > table.csv
git commit -am "version bli"
echo "1,2" >> table.csv
git commit -am "Add data"
Issue
user:/tmp/test$ FILE=table.csv
user:/tmp/test$ LINE=1
user:/tmp/test$ git log --format=format:%H $FILE | xargs -L 1 git blame $FILE -L $LINE,$LINE
e4a89a75 (user 2022-08-10 16:45:04 0200 1) bla,bli
e4a89a75 (user 2022-08-10 16:45:04 0200 1) bla,bli
^58b4b88 (user 2022-08-10 16:44:16 0200 1) bla,bla
The issue is that the last commit appears twice, eventhought the first line wasn't changed.
Expected output
e4a89a75 (user 2022-08-10 16:45:04 0200 1) bla,bli
^58b4b88 (user 2022-08-10 16:44:16 0200 1) bla,bla
What I tried
The log part of the instruction currently uses format:%H
user:/tmp/test$ git log --format=format:%H table.csv
c51873404aa45fb50fcbd6bd7ea06ab1e9f22071
e4a89a75e48623a1d2967996e6de3a250607e6a5
58b4b88800dd57cb1ca0476f1b9939781af28600
I tried adding the L1,1:
argument to the log section but it formats the log differently so that the output cannot work anymore as an input to xargs
user:/tmp/test$ git log --format=format:%H -L1,1:table.csv
e4a89a75e48623a1d2967996e6de3a250607e6a5
diff --git a/table.csv b/table.csv
--- a/table.csv
b/table.csv
@@ -1,1 1,1 @@
-bla,bla
bla,bli
58b4b88800dd57cb1ca0476f1b9939781af28600
diff --git a/table.csv b/table.csv
--- /dev/null
b/table.csv
@@ -0,0 1,1 @@
bla,bla
Putting the log on one line may not be possible when using -L
according to this answer:
"[...] git log --oneline -L 10,11:example.txt does work (it does however output the full patch)."
CodePudding user response:
(First, big thanks for the reproducer—it was helpful—but one note: watch out, your quotes got mangled into "smart quotes" instead of plain double quotes. I fixed them.)
I would like to show only changes to the column headers of a csv file tracked by git.
Based on the example, by "column headers" I take it you mean "line 1".
The basic problem starts here:
git log --format=format:%H $FILE | ...
This finds, and prints the hash ID of, each occurrence of a commit that changes anything in the given file. (FILE
needs to be set to table.csv
here.) This is not at all what you want! Its only function is to completely skip any commit where the file is entirely un-changed (which could be a useful function in real world examples, but not so much in your reproducer since every commit changes the file here.)
(Side note: whenever it's possible, use git rev-list
instead of git log
. It's possible here. However, we're going to end up discarding git log
/ git rev-list
anyway.)
... | xargs -L 1 git blame $FILE -L $LINE,$LINE
(Here, LINE
needs to be set to 1.) The general idea here seems to be to run git blame
on one specific line (in this case line 1), which is fine as far as it goes, but isn't really want we want. If our left-side command, git log ... $FILE
, had selected just the revisions we want, those would already be the revisions we want and we could just stop here.
The real trick here is to run git blame
repeatedly but only until the blame "runs out". Each invocation of git blame
should tell us who / which commit is "responsible for" (i.e., produced this version of) the given line, and that's exactly what git blame
does. You give it a starting (ending?—Git works backwards, so we start at the end and work backwards) revision, and Git checks that version and the previous commit to see if the line in question changed in that version. If so, we're done: we print that version and the line. If not, we put the previous version in place and repeat. We do this until we run out of "previous versions", in which case we just print this version and stop.
So git blame
is already doing what you want. The only problem is that it stops after it finds the "previous version" to print. So what we really want is to build a loop:
do {
rev, other-info, output = <what git blame does>
print rev and/or output in appropriate format
} while other-info says there are previous revs
The way to deal with this is to use --porcelain
(or --incremental
but --porcelain
seems most appropriate here). We know that -L 1,1
(or -L $LINE,$LINE
) is going to output a single line at the end. We want to collect the remaining lines. The output from --porcelain
is described in the documentation: it's a series of lines with, in our case, the first and last being of interest, and the middle ones might be interesting, or might not, except that previous
or boundary
is always of interest.
Shell parsing is kind of messy, so it's probably best to use some other language to handle the output from git blame
. For instance, we might use a small Python program. This one doesn't have many features but shows how to use --porcelain
here, and should be easy to modify. It has been very lightly tested (and run through black for formatting and mypy for type checking, but definitely needs better error handling. For instance, running it with a nonexistent pathname gets you a fatal
error message, but then a Python traceback. I leave the cleanup to someone else, at this point.
#! /usr/bin/env python3
"""
Analyze "git blame" output and repeat until we reach the boundary.
"""
import argparse
import subprocess
import sys
def blame(path: str, args: argparse.Namespace) -> None:
rev = "HEAD"
while True:
cmd = [
"git",
"blame",
"--porcelain",
f"-L{args.line},{args.line}",
rev,
"--",
path,
]
# if args.debug:
# print(cmd)
proc = subprocess.Popen(
cmd, shell=False, universal_newlines=True, stdout=subprocess.PIPE,
)
assert proc.stdout is not None
info = proc.stdout.readline().split()
rev = info[0]
kws = {}
match = None
for line in proc.stdout:
line = line.rstrip("\n")
if line.startswith("\t"):
# here's our match, there won't be anything else
match = line
else:
parts = line.split(" ", 1)
kws[parts[0]] = parts[1] if len(parts) > 1 else None
status = proc.wait()
if status != 0:
print(f"'{' '.join(cmd)}' returned {status}")
# found something useful
print(f"{rev}: {match}")
if "boundary" in kws:
break
prev = kws["previous"]
assert prev is not None
parts = prev.split(" ", 1)
assert len(parts) == 2
rev = parts[0]
path = parts[1]
def main() -> int:
parser = argparse.ArgumentParser("foo")
parser.add_argument("--line", "-l", type=int, default=1)
parser.add_argument("files", nargs=" ")
args = parser.parse_args()
for path in args.files:
blame(path, args)
return 0
if __name__ == "__main__":
try:
sys.exit(main())
except KeyboardInterrupt:
sys.exit("\nInterrupted")