How git write-tree works internally?-CodePudding

I wonder how git commit works internally. On that, I mean, how the plumbing git write-tree works internally. I have the index with a couple of files, say:

% git ls-files --stage
100644 5376393fc691d519eb50baf65777249bf1fe1e15 0   dir1/file2
100644 e56e15bb7ddb6bd0b6d924b18fcee53d8713d7ea 0   file1

And I have two blobs inside my objects folder:

% ls -la .git/objects/
drwxr-xr-x   3 admin  staff   96 May  5 16:47 53
drwxr-xr-x   3 admin  staff   96 May  5 16:47 e5

As you can see, I need to create two trees in order to create a commit. So I write

git write-tree

and sure enough I got two more files inside my objects folder.

% ls -la .git/objects/
drwxr-xr-x   3 admin  staff   96 May  5 16:52 22
drwxr-xr-x   3 admin  staff   96 May  5 16:52 44
drwxr-xr-x   3 admin  staff   96 May  5 16:47 53
drwxr-xr-x   3 admin  staff   96 May  5 16:47 e5

My question is how to calculate this trees? What's the algorithm, that take only index file and create tree files?

Suppose my new index is this

100644 5376393fc691d519eb50baf65777249bf1fe1e15 0   dir1/file2
100644 5376393fc691d519eb50baf65777249bf1fe1e15 0   dir1/file5
100644 5376393fc691d519eb50baf65777249bf1fe1e15 0   dir1/dir2/file4
100644 5376393fc691d519eb50baf65777249bf1fe1e15 0   dir1/dir2/file3
100644 e56e15bb7ddb6bd0b6d924b18fcee53d8713d7ea 0   file1

I need to create a few trees from this index. I would start to make them from bottom to top. How do I know where is the bottom?

My guess is that we take everything before the first / and look for other blobs(or trees) with that prefix. Then we create a tree out of it.

Then we take everything before second / and repeat the process.

That looks performance bad to me.

What do you think?

Maybe you can show a piece of source code. (I was unable to find one)

Thanks!

CodePudding user response：

A tree is a recursive structure: It can contain files and other trees. So to create "the tree" you will need some recursion:

find all directories that only contain files and subdirectories you already have a tree for (in the first iteration this will only find directories without any subdirectories)
make a tree out of each of those directories
repeat until there are no more unprocessed directories left

The last tree you created is for the root directory and that is the tree you put into your commit.

There is a good explanation in the git documentation.

For optimization I would

make the algorithm iterative
sort all index-elements by their nesting-depth (descending) and paths

That way you can just go through the sorted list and create the trees as you go. There will never be the situation that you need to reference a tree you did not yet create.

That performance looks bad to me.

Since sorting is O(n log n) the whole algorithm is of that complexity. Quite OK I think.

Edit:

As @torek mentioned, the index is already sorted so the complexity is really just O(n).

CodePudding user response：

I'll add an answer (besides my comments) that addresses this particular point:

I need to create a few trees from [an] index [that I created].

Git's so-called plumbing commands—including git write-tree and git commit-tree—let you build tree and commit objects piecemeal, for special purposes, by writing shell scripts. This is a lot easier, albeit less efficient, than writing compiled code like a C or Java or Go program (in Java you can use JGit, and in Go there's a Go package, all of which have a lot of this written for you, but we'll just address use of C Git through the shell here).

The git reset, git add, and git write-tree commands, all of which can be used as plumbing (though git reset and git add aren't specifically meant as such) all accept and use an environment variable named GIT_INDEX_FILE. This variable can be set to the path name of an existing file that contains a valid index, or a non-existent file that should be created holding a valid index. Hence, in a shell script, you can do this:

set -e
tf=$(mktemp)
rm -f "$tf "                           # an empty file is invalid
trap "rm -f \"$tf\"" 0 1 2 3 15        # clean up on exit
GIT_INDEX_FILE="$tf" git reset         # build an index based on HEAD
GIT_INDEX_FILE="$tf" git add file1.ext
tree=$(git write-tree)

When (or rather if) this shell fragment finishes without failing, you've written out a Git tree object containing the current (HEAD) commit plus one file file1.ext. (If the current commit contains a file1.ext, that file has been replaced with the one from the current working directory.)

You can use git hash-object -w and git update-index rather than git add; the git update-index plumbing command also obeys the GIT_INDEX_FILE environment variable.

Should you desire to create a specific tree from scratch, you can use git mktree, a plumbing command that reads git ls-tree format input files, and writes a tree object.

You can set GIT_INDEX_FILE and export it:

export GIT_INDEX_FILE="$tf"

which keeps it set in the environment for all future commands, if that's more convenient (but then you must unset GIT_INDEX_FILE to undo that if you want Git to use the normal index file). In the above, I set it for each command that uses the index just to show which commands normally use "the" index (which is replaced with our chosen temporary file instead). (All C Git commands obey this $GIT_INDEX_FILE setting if and when they use an index, unless they're doing their own special override-the-index-with-a-temporary internally.)

Note that all operations done with a temporary index and/or with git hash-object -w and with a tree object that has not yet been committed or tagged or otherwise made reachable must be completed before git gc might eat any of the created objects. As git prune has a default grace period of 7 days and git gc has one of 14 days, and git gc is the usual eater of dead objects, you normally have 14 days to complete your task, which is usually sufficient in these days of multi-gigahertz CPUs.