Git: Show the committed contents of all files that changed between two revisions-CodePudding

Given a commit range <old_sha>..<new_sha>, I want the committed contents (not the list of changed files, and not their diffs), as of <new_sha>, of all files that changed between <old_sha> and <new_sha>. I can't simply read from disk because I want to ignore un-committed changes.

Additional constraints:

I don't want to modify the working directory as part of this operation, so I can't, e.g., stash unstaged changes, read from disk, and then git stash pop.
Must be relatively performant in a very large repo (it'll be executed in a pre-push hook).

The closest I've gotten is:

Get the list of changed files
Run git show head:path/to/file_a head:path/to/file_b, which writes the committed file contents to stdout, but without a delimiter, so there's no way of programmatically knowing where each file ends/begins.

CodePudding user response：

The contents of some committed file are only available by reading out that file ("blob object", in Git terms). So given two commit hashes $O and $N (old and new, though either order is fine) you will need to:

run git diff-tree -r --name-status $O $N (or anything equivalent—see below) and read its output to determine which files you care about;
extract all of those blobs.

Your git show method could work—replace HEAD:path/to/file_a with $N:path/to/file_a unless you're guaranteed that $N and HEAD refer to the same commit—but you'll need one git show command per file. Consider instead using read-tree and git checkout-index with a temporary index:

tf=$(mktemp)
# an empty file, as made by mktemp, is unsuitable, but a nonexistent
# one is OK, so:
rm -f $tf

Assuming you want the old versions of the files too (if not, trim this to just one $N case):

GIT_INDEX_FILE=$tf git read-tree $O
# you can now `git checkout-index` files using $tf as the index

If you have a \0-delimited or newline-delimited list of file names in a file named $filelist, consider using --stdin and a temporary (empty) working tree here, e.g.:

td=$(mktemp -d)
GIT_WORK_TREE=$td GIT_INDEX_FILE=$tf git checkout-index --stdin < $filelist

Add the -z option for \0-delimited files, to protect against the possibility of newlines in file names; this is useful if you, e.g., write a Python script to run the git diff-tree with its own -z option. Remember to filter out any files that did not exist in $O but do exist in $N (these show up as A in $N, which is why we want --name-status).

The temporary directory is now populated with the interesting files from $O.

Repeat with a second temporary directory to populate one with interesting files from $N; remember to include Added files and omit Deleted ones. If you don't explicitly enable the rename and/or copy detectors during the git diff-tree step, you won't have to deal with R and/or C status files—which is a big plus—but you won't be able to detect renamed and/or copied files, which may be a dreadful error.

If you do need both old and new commits, note that you can use read-tree and checkout-index as above but without any constraints on file names, and you now have both entire commits' trees. So now you can use plain diff or any other recursive tree-walker you like. This will be slower for a lot of cases (where many files are identical), but might be more convenient.

You now have the various files' contents in two temporary directories, where you can do whatever you wish with them. To clean up, simply remove the two temporary directories and the temporary index.

This whole process probably won't be tremendously quick (though using an SSD for the temporary files and directories should help a lot) but it will probably be about the best you can do.

CodePudding user response：

Faster than @torek's but only if you've got a huge number of files in a full checkout:

old=@ new=@^ # e.g.

scratchindex=`mktemp -ut $USER.index.$$.XXXXXX`
scratchwork=`mktemp -dt $USER.worktree.$$.XXXXXX`

git diff-tree -r --diff-filter=d $old $new \
| sed -E 's,[^ ]* ([^ ]*) [^ ]* ([^ ]*) ..(.*),\1 \2\t\3,' \
| GIT_INDEX_FILE=$scratchindex git update-index --index-info

GIT_INDEX_FILE=$scratchindex GIT_WORK_TREE=$scratchwork \
        git checkout-index -a

which only loads a changed-files index, this will help noticeably when you've got a project on the order of the linux kernel or blender, with index files that take time to load from scratch. Your output's in $scratchwork.