Home > other >  To display changes to the first line of a csv file tracked by git, can git log be on one line when u
To display changes to the first line of a csv file tracked by git, can git log be on one line when u

Time:08-12

I would like to show only changes to the column headers of a csv file tracked by git. I use the code in this nice answer by Kirill Müller. It works almost perfectly except that it repeats the lines even if the commit didn't actually change the first line of the file.

Reproducible code

cd /tmp/
mkdir test
cd test/
git init
echo "bla,bla" > table.csv
git add table.csv
git commit -m "version bla"
echo "bla,bli" > table.csv
git commit -am "version bli"
echo "1,2" >> table.csv
git commit -am "Add data"

Issue

user:/tmp/test$ FILE=table.csv
user:/tmp/test$ LINE=1
user:/tmp/test$ git log --format=format:%H $FILE | xargs -L 1 git blame $FILE -L $LINE,$LINE
e4a89a75 (user 2022-08-10 16:45:04  0200 1) bla,bli
e4a89a75 (user 2022-08-10 16:45:04  0200 1) bla,bli
^58b4b88 (user 2022-08-10 16:44:16  0200 1) bla,bla

The issue is that the last commit appears twice, eventhought the first line wasn't changed.

Expected output

e4a89a75 (user 2022-08-10 16:45:04  0200 1) bla,bli
^58b4b88 (user 2022-08-10 16:44:16  0200 1) bla,bla

What I tried

The log part of the instruction currently uses format:%H

user:/tmp/test$ git log --format=format:%H table.csv
c51873404aa45fb50fcbd6bd7ea06ab1e9f22071
e4a89a75e48623a1d2967996e6de3a250607e6a5
58b4b88800dd57cb1ca0476f1b9939781af28600

I tried adding the L1,1: argument to the log section but it formats the log differently so that the output cannot work anymore as an input to xargs

user:/tmp/test$ git log --format=format:%H -L1,1:table.csv
e4a89a75e48623a1d2967996e6de3a250607e6a5
diff --git a/table.csv b/table.csv
--- a/table.csv
    b/table.csv
@@ -1,1  1,1 @@
-bla,bla
 bla,bli

58b4b88800dd57cb1ca0476f1b9939781af28600
diff --git a/table.csv b/table.csv
--- /dev/null
    b/table.csv
@@ -0,0  1,1 @@
 bla,bla

Putting the log on one line may not be possible when using -L according to this answer:

"[...] git log --oneline -L 10,11:example.txt does work (it does however output the full patch)."

CodePudding user response:

(First, big thanks for the reproducer—it was helpful—but one note: watch out, your quotes got mangled into "smart quotes" instead of plain double quotes. I fixed them.)

I would like to show only changes to the column headers of a csv file tracked by git.

Based on the example, by "column headers" I take it you mean "line 1".

The basic problem starts here:

git log --format=format:%H $FILE | ...

This finds, and prints the hash ID of, each occurrence of a commit that changes anything in the given file. (FILE needs to be set to table.csv here.) This is not at all what you want! Its only function is to completely skip any commit where the file is entirely un-changed (which could be a useful function in real world examples, but not so much in your reproducer since every commit changes the file here.)

(Side note: whenever it's possible, use git rev-list instead of git log. It's possible here. However, we're going to end up discarding git log / git rev-list anyway.)

... | xargs -L 1 git blame $FILE -L $LINE,$LINE

(Here, LINE needs to be set to 1.) The general idea here seems to be to run git blame on one specific line (in this case line 1), which is fine as far as it goes, but isn't really want we want. If our left-side command, git log ... $FILE, had selected just the revisions we want, those would already be the revisions we want and we could just stop here.

The real trick here is to run git blame repeatedly but only until the blame "runs out". Each invocation of git blame should tell us who / which commit is "responsible for" (i.e., produced this version of) the given line, and that's exactly what git blame does. You give it a starting (ending?—Git works backwards, so we start at the end and work backwards) revision, and Git checks that version and the previous commit to see if the line in question changed in that version. If so, we're done: we print that version and the line. If not, we put the previous version in place and repeat. We do this until we run out of "previous versions", in which case we just print this version and stop.

So git blame is already doing what you want. The only problem is that it stops after it finds the "previous version" to print. So what we really want is to build a loop:

do {
    rev, other-info, output = <what git blame does>
    print rev and/or output in appropriate format
} while other-info says there are previous revs

The way to deal with this is to use --porcelain (or --incremental but --porcelain seems most appropriate here). We know that -L 1,1 (or -L $LINE,$LINE) is going to output a single line at the end. We want to collect the remaining lines. The output from --porcelain is described in the documentation: it's a series of lines with, in our case, the first and last being of interest, and the middle ones might be interesting, or might not, except that previous or boundary is always of interest.

Shell parsing is kind of messy, so it's probably best to use some other language to handle the output from git blame. For instance, we might use a small Python program. This one doesn't have many features but shows how to use --porcelain here, and should be easy to modify. It has been very lightly tested (and run through black for formatting and mypy for type checking, but definitely needs better error handling. For instance, running it with a nonexistent pathname gets you a fatal error message, but then a Python traceback. I leave the cleanup to someone else, at this point.

#! /usr/bin/env python3

"""
Analyze "git blame" output and repeat until we reach the boundary.
"""

import argparse
import subprocess
import sys


def blame(path: str, args: argparse.Namespace) -> None:
    rev = "HEAD"
    while True:
        cmd = [
            "git",
            "blame",
            "--porcelain",
            f"-L{args.line},{args.line}",
            rev,
            "--",
            path,
        ]
        # if args.debug:
        #    print(cmd)
        proc = subprocess.Popen(
            cmd, shell=False, universal_newlines=True, stdout=subprocess.PIPE,
        )
        assert proc.stdout is not None
        info = proc.stdout.readline().split()
        rev = info[0]
        kws = {}
        match = None
        for line in proc.stdout:
            line = line.rstrip("\n")
            if line.startswith("\t"):
                # here's our match, there won't be anything else
                match = line
            else:
                parts = line.split(" ", 1)
                kws[parts[0]] = parts[1] if len(parts) > 1 else None
        status = proc.wait()
        if status != 0:
            print(f"'{' '.join(cmd)}' returned {status}")

        # found something useful
        print(f"{rev}: {match}")
        if "boundary" in kws:
            break
        prev = kws["previous"]
        assert prev is not None
        parts = prev.split(" ", 1)
        assert len(parts) == 2
        rev = parts[0]
        path = parts[1]


def main() -> int:
    parser = argparse.ArgumentParser("foo")
    parser.add_argument("--line", "-l", type=int, default=1)
    parser.add_argument("files", nargs=" ")
    args = parser.parse_args()
    for path in args.files:
        blame(path, args)
    return 0


if __name__ == "__main__":
    try:
        sys.exit(main())
    except KeyboardInterrupt:
        sys.exit("\nInterrupted")
  • Related