Home > Software engineering >  What problem is trying to solve a Git --bare repo?
What problem is trying to solve a Git --bare repo?

Time:05-05

I think I understood the practical difference between a bare and non-bare repo in Git, but I really don't get why logically this distinction exists: why Git had to implement the concept of bare and non-bare repos? I know there are already tons of thread and articles about the topic, but I am really missing some concrete examples to fully understand the topic.

To recap, the practical difference (i.e. in terms of files) between a non-bare and a bare repo should be the following:

  • a non-bare repo is a combination of:

    1. a .git folder, that is a "special" folder that Git, as a software, uses to store all the data (e.g. blobs of the files of the project is versioning) and metadata (e.g. history of commits) to be able to work properly.
    2. a working tree, that is the actual files and folders that represent our project. One crucial thing to keep in mind is that the working tree isn't the only place where the content of your project is stored. The data of your project is saved by Git also inside the .git folder in a special format using some internal tools like .git/objects. The working tree exists because it's the way everything that is not Git can work (hence the name working tree) and edit the project files and folders.
  • a bare repo is:

    1. the contents of the .git subdirectory right in the main directory itself
    2. no working tree

The question is: why do I need an intermediate bare repo to sync in a convenient way two non-bare repos? A lot of threads and articles are answering saying that not having a central bare repo would lead the central working tree to be out of sync (see here). Ok, but why? Can someone provide a concrete example?

The situation I can imagine is the following:

  1. A is a local repo, B is a local repo and C is the central remote repo. All are non-bare.
  2. A makes a commit c and pushes it on C. C updates its .git folder and it's working tree accordingly to the changes in c. Important: "updating the working tree" in my mind means replacing the C working tree (i.e. C files and folders) with the A working tree.
  3. B pulls the changes from C and updates its .git folder and working tree accordingly to the changes in c

How can a situation like the one described above can make the working tree of C out of sync? What does even mean that C goes out of sync?

The only true advantage that I understood so far is that for services like Github or Gitlab not maintaining a working tree (i.e. having a bare repo) for each repo and for each branch is very convenient to save storage space. They can reconstruct the working tree on the fly leveraging the Git tools.

CodePudding user response:

It's relatively simple, really. A bare repository has no working tree, therefore it cannot have an active checkout.1 And, as you've seen elsewhere, the issue is that pushing to an active checkout of some branch results in an out-of-sync checkout. So Git forbids pushing to the checked-out branch.2 By the fact that a bare repository has no working tree, and therefore no checked-out branch, a bare repository sidesteps the problem.

What does even mean that [the non-bare central repo] goes out of sync?

Let's dispense with the third machine: we only need client, a non-bare repository, and server, the repository that should be but isn't bare.

On server, branch main is actively checked-out. Someone may or may not be logged in to server and editing files there.

Meanwhile, on client, you've made some new commit and you run git push and send the new commit to server. If the server accepts this commit, there are now two possibilities:

  1. server's Git repository doesn't update the checked-out working tree, or
  2. server's Git repository does update the checked-out working tree.

Both situations can produce a bad outcome. Before we start on that, let's explore Git's workings a bit.


1This was true before git worktree add was added, and now it isn't. So the simplicity was there until Git 2.5, and now it isn't.

2This was true in original Git, before the invention of various configuration items. Now it isn't. So the simplicity was there once, and now it isn't. (The receive.denyCurrentBranch stuff happened before the git worktree command, but I don't recall offhand which version that was.)


Git is about commits; commits are numbered; branch names find commits

A Git repository consists mainly of two databases, one usually much larger than the other. The larger database contains commits and supporting Git objects. The smaller database contains names, such as branch and tag names.

The commit objects are numbered, with the numbers expressed as hexadecimal hash IDs. Git needs the hash ID to find the commit: the big database is indexed solely by hash ID.

A commit itself contains two things:

  • a full snapshot of every file, as of the state it should have for that commit; and
  • metadata: information about the commit itself, such as who made it, when, and why (a log message).

In the metadata for any given commit, Git stores the raw hash ID(s) of the parent or parents of that commit. A commit therefore has a list of previous-commit hash IDs, stored in its metadata. This forms the history in the repository.

To be able to obtain the latest commit for any given branch, Git stores, in the branch name (e.g., refs/heads/main), the raw hash ID of the latest commit. That commit contains, in its metadata, the hash ID of the previous (parent) commit, which in turn contains another hash ID for another parent, and so on.

When we use git checkout or git switch with a branch name, we're telling Git: extract the latest commit for that branch. That's the one whose hash ID is stored in the branch name. So with git switch main, Git looks up refs/heads/main, finds a hash ID such as a123456..., and looks up that commit in the database. That commit has a set of files associated with it. Git copies those files out of the commit—the ones in the commit aren't generally usable by the OS, as they're in a read-only, compressed, Git-only, de-duplicated form—to your working tree.

But, Git also copies the files—or rather, information about the files (names and blob hash IDs)—into Git's index, which goes along with the working tree. This defines which files are tracked, helps Git go fast, and is generally needed to know what to put in the next commit.

Once this is all in place, Git sets up the special name HEAD to contain the branch name. (In original Git, this was a symbolic link to the refs/heads/main file, but as with many bits of Git, that was done away with more than a decade ago.)

There's now a group of well-defined, carefully-coordinated data:

  • HEAD contains the current branch name;
  • the branch name contains the hash ID;
  • Git's index contains the file names and blob hash IDs for the next commit, tracking the files in the working tree; and
  • the working tree contains the files copied out of the commit.

You work on the files, run git add to tell Git to update what's in Git's index, and eventually run git commit. At this point Git:

  • reads HEAD and then the branch name to find the current commit;
  • collects all necessary metadata;
  • packages up the index's content;
  • writes all of this to a new commit, which gets a new unique hash ID; and
  • writes the hash ID into the branch name.

The carefully-coordinated data lets Git do all of this, and is still carefully-coordinated.

If someone else commits while you're working ...

Now suppose we're on the server, working, and someone on the client commits. That's not a problem because the client Git repository has its own branch names. It gets a new commit in its commit database, and its branch name main stores a new hash ID. But over here on server, our Git databases are unchanged.

But if they now run git push main and send their commit, our Git has to either accept their commit or reject it. If we reject it, that's fine: our databases remain unchanged and everything is still coordinated.

Let's say that instead, though, we accept the push. The server Git updates refs/heads/main to store their commit hash ID. Our two possibilities are:

  1. don't update index and working tree;
  2. do update index and working tree.

If we choose possibility #1, then we have a "stale checkout": our files are from the previous commit. But the branch name holds the new commit hash ID. So we're out of sync. If we update any files and then commit, we'll revert the other guy's work (remember that our Git software uses what's in our index, which matches our working tree). That's not great, so let's move on to option 2.

If we choose option 2, our files get ripped away from us and replaced. Our index and working tree are re-synchronized with the updated branch name. That's better ... except, if we're actively working on some file, what happens to our work? Maybe our editor notices that the underlying file has changed and gives us a chance to fix things. Maybe it just overwrites the underlying file. Either way, it's likely to be a problem.

So, updating the working tree of a server's repository is perhaps better than not doing that, and that's what receive.denyCurrentBranch's updateInstead setting does. It's not perfect, though. "Perfect" is just don't have a working tree so that nothing can go wrong, and we get that with --bare.

  • Related