I wonder how git commit
works internally. On that, I mean, how the plumbing git write-tree
works internally. I have the index with a couple of files, say:
% git ls-files --stage
100644 5376393fc691d519eb50baf65777249bf1fe1e15 0 dir1/file2
100644 e56e15bb7ddb6bd0b6d924b18fcee53d8713d7ea 0 file1
And I have two blobs inside my objects folder:
% ls -la .git/objects/
drwxr-xr-x 3 admin staff 96 May 5 16:47 53
drwxr-xr-x 3 admin staff 96 May 5 16:47 e5
As you can see, I need to create two trees in order to create a commit. So I write
git write-tree
and sure enough I got two more files inside my objects folder.
% ls -la .git/objects/
drwxr-xr-x 3 admin staff 96 May 5 16:52 22
drwxr-xr-x 3 admin staff 96 May 5 16:52 44
drwxr-xr-x 3 admin staff 96 May 5 16:47 53
drwxr-xr-x 3 admin staff 96 May 5 16:47 e5
My question is how to calculate this trees? What's the algorithm, that take only index file and create tree files?
Suppose my new index is this
100644 5376393fc691d519eb50baf65777249bf1fe1e15 0 dir1/file2
100644 5376393fc691d519eb50baf65777249bf1fe1e15 0 dir1/file5
100644 5376393fc691d519eb50baf65777249bf1fe1e15 0 dir1/dir2/file4
100644 5376393fc691d519eb50baf65777249bf1fe1e15 0 dir1/dir2/file3
100644 e56e15bb7ddb6bd0b6d924b18fcee53d8713d7ea 0 file1
I need to create a few trees from this index. I would start to make them from bottom to top. How do I know where is the bottom?
My guess is that we take everything before the first /
and look for other blobs(or trees) with that prefix. Then we create a tree out of it.
Then we take everything before second /
and repeat the process.
That looks performance bad to me.
What do you think?
Maybe you can show a piece of source code. (I was unable to find one)
Thanks!
CodePudding user response:
A tree
is a recursive structure: It can contain files and other tree
s. So to create "the tree" you will need some recursion:
- find all directories that only contain files and subdirectories you already have a tree for (in the first iteration this will only find directories without any subdirectories)
- make a tree out of each of those directories
- repeat until there are no more unprocessed directories left
The last tree you created is for the root directory and that is the tree you put into your commit.
There is a good explanation in the git documentation.
For optimization I would
- make the algorithm iterative
- sort all index-elements by their nesting-depth (descending) and paths
That way you can just go through the sorted list and create the trees as you go. There will never be the situation that you need to reference a tree you did not yet create.
That performance looks bad to me.
Since sorting is O(n log n)
the whole algorithm is of that complexity. Quite OK I think.
Edit:
As @torek mentioned, the index is already sorted so the complexity is really just O(n)
.
CodePudding user response:
I'll add an answer (besides my comments) that addresses this particular point:
I need to create a few trees from [an] index [that I created].
Git's so-called plumbing commands—including git write-tree
and git commit-tree
—let you build tree and commit objects piecemeal, for special purposes, by writing shell scripts. This is a lot easier, albeit less efficient, than writing compiled code like a C or Java or Go program (in Java you can use JGit, and in Go there's a Go package, all of which have a lot of this written for you, but we'll just address use of C Git through the shell here).
The git reset
, git add
, and git write-tree
commands, all of which can be used as plumbing (though git reset
and git add
aren't specifically meant as such) all accept and use an environment variable named GIT_INDEX_FILE
. This variable can be set to the path name of an existing file that contains a valid index, or a non-existent file that should be created holding a valid index. Hence, in a shell script, you can do this:
set -e
tf=$(mktemp)
rm -f "$tf " # an empty file is invalid
trap "rm -f \"$tf\"" 0 1 2 3 15 # clean up on exit
GIT_INDEX_FILE="$tf" git reset # build an index based on HEAD
GIT_INDEX_FILE="$tf" git add file1.ext
tree=$(git write-tree)
When (or rather if) this shell fragment finishes without failing, you've written out a Git tree object containing the current (HEAD
) commit plus one file file1.ext
. (If the current commit contains a file1.ext
, that file has been replaced with the one from the current working directory.)
You can use git hash-object -w
and git update-index
rather than git add
; the git update-index
plumbing command also obeys the GIT_INDEX_FILE
environment variable.
Should you desire to create a specific tree from scratch, you can use git mktree
, a plumbing command that reads git ls-tree
format input files, and writes a tree object.
You can set GIT_INDEX_FILE
and export it:
export GIT_INDEX_FILE="$tf"
which keeps it set in the environment for all future commands, if that's more convenient (but then you must unset GIT_INDEX_FILE
to undo that if you want Git to use the normal index file). In the above, I set it for each command that uses the index just to show which commands normally use "the" index (which is replaced with our chosen temporary file instead). (All C Git commands obey this $GIT_INDEX_FILE
setting if and when they use an index, unless they're doing their own special override-the-index-with-a-temporary internally.)
Note that all operations done with a temporary index and/or with git hash-object -w
and with a tree object that has not yet been committed or tagged or otherwise made reachable must be completed before git gc
might eat any of the created objects. As git prune
has a default grace period of 7 days and git gc
has one of 14 days, and git gc
is the usual eater of dead objects, you normally have 14 days to complete your task, which is usually sufficient in these days of multi-gigahertz CPUs.