Why line with "fetch" in git config effects how `git branch --set-upstream-to` work?-CodePudding

I took me a while to understand root cause of "strange" results I noted recently with my repo. Below is simple reperformance, works as I understand with any repo with remote origin and main branch:

git branch --set-upstream-to=origin/main main
Branch 'main' set up to track remote branch 'main' from 'origin'.

Now edit .git/config

[remote "origin"]
    url = /tmp/test/p0-bkup
was:
    fetch =  refs/heads/*:refs/remotes/origin/*
replace with:
    fetch =  refs/*:refs/*

Result is now different:

git branch --set-upstream-to=origin/main main
Branch 'main' set up to track remote ref 'refs/remotes/origin/main'.

And in config now:

[branch "main"]
    remote = origin
    merge = refs/remotes/origin/main

not merge = refs/heads/main

For those curious why I think it is important, fetch = refs/*:refs/* is written in config e.g. after git clone --mirror

Why line for fetch affects git branch --set-upstream-to? I thought that line is what fetch should do. TIA. Bonus questions: how fetch line was processed and was used to produce each of the results? Any use case for mirrored repo for way it works?

Added:

Below is the code I use now to test git by creating simple repos in /tmp, for my question it has output at the end:

#!/bin/bash

test_dir=/tmp/test
if [ -d $test_dir ] ; then rm --recursive --dir --force $test_dir/* ; else mkdir $test_dir; fi
file_name=text.txt
cd $test_dir

edit_file_and_commit(){
    echo "$1" | tee --append $file_name
    git add $file_name
    git commit -m "$1"
}

print_and_run(){
    echo next line: "$1"
    $1 # echo "$($1)"
}

print_and_run_2() {
    printf "next line:"
    printf " '%s'" "$@"
    printf "\n"
    echo "$("$@")"
}

pause(){ echo; echo pause, type exit to end the pause; bash; }

echo;echo now in p0
mkdir p0 && cd $_
git init
touch $file_name
edit_file_and_commit "commit main-1 in p0"
git branch --move main # rename in case default is master
git checkout -b devel
edit_file_and_commit "commit devel-1 in p0"
git checkout main
edit_file_and_commit "commit main-2 in p0"

touch text2
git add text2
git commit -m "text2"
git checkout devel

echo;echo now in p1
cd $test_dir
mkdir p1 && cd $_
git clone $test_dir/p0 .

mv $test_dir/p0 $test_dir/p0-bkup

git branch -a
git checkout main
git branch -a
ls -al

git remote set-url origin $test_dir/p0-bkup
git push
echo next git remote -v 
git remote -v
echo next ls -la $test_dir/p1/.git/refs/remotes/origin
ls -la $test_dir/p1/.git/refs/remotes/origin
echo I see HEAD only, AFAIK not as in fully set repo
git branch -vv

git checkout main
edit_file_and_commit "commit main-1 in p1"
git checkout -b qa

echo;echo now in p2
cd $test_dir
mkdir p2 && cd $_
git clone --mirror $test_dir/p1 ./.git
git config core.bare false
git status
git checkout main
git status
edit_file_and_commit "commit main-1 in p2"

print_and_run 'cat ./.git/config'
print_and_run 'git branch --set-upstream-to=origin/main main'
print_and_run 'cat ./.git/config'
print_and_run_2 sed --in-place -- "s|fetch =  refs/\*:refs/\*|fetch =  refs/heads/*:refs/remotes/origin/*|" .git/config
set -x # Print shell input lines as they are read.
cat ./.git/config
git branch --set-upstream-to=origin/main main
cat ./.git/config

set  x 

exit

CodePudding user response：

Now edit .git/config [to set the fetch line to read refs/*:refs/*]

Don't do that. That turns your regular repository into a mirror that discards your own work.¹

Why line for fetch affects git branch --set-upstream-to?

This is rather complicated, for historical reasons. Let me end this section and, first, dive into technical details about how the upstream is represented.

¹Technically, you keep your own commits. What happens is that when you make commits that update your own refs, and then run git fetch, you lose the updated refs that let you find your own commits. So while your new commits exist, you can no longer find them. There are some workarounds, but this is generally just the wrong thing to do.

Representing the upstream of a branch

As I've noted elsewhere, each (local) branch name can have one (1) upstream setting. At the level one normally uses for interacting with Git, the upstream is a simple string, like origin/xyzzy for branch xyzzy, and we use git branch --set-upstream-to to set it, or git branch --unset-upstream to clear out the setting.

Internally, however, the upstream of branch xyzzy is represented in the .git/config file as two values:

branch.xyzzy.remote
branch.xyzzy.merge

When the upstream of branch xyzzy is origin/xyzzy, these two are normally set like this:

[branch "xyzzy"]
    remote = origin
    merge = refs/heads/xyzzy

Note that this is not merge = refs/remotes/origin/xyzzy, but rather merge = refs/heads/xyzzy. Why? Well, this requires even more history; a brief sidebar here is appropriate as well.

Sidebar: Refnames, branches, and tags, oh my!

A Git repository is, internally, primarily just a big database of Git objects. These objects store commits, which—at the Git object level—consist entirely of metadata that references one tree object and arbitrarily many other, previous commit objects. The single distinguished tree object then, indirectly via more trees and/or blob objects, stores a full snapshot of your source files. Since our lowest-level goal with Git is to store every version of every file ever committed, this accomplishes that goal.

Git finds its internal objects by hash IDs, e.g., c48035d29b4e524aed3a32f0403676f0d9128863. These things are decidedly human-hostile (the opposite of human-friendly). To let Git users refer t commits by name, as humans prefer to do, and also to solve several other problems all at once, Git allows us to use names. Each Git repository therefore has a second, separate database, in which a name—a not quite arbitrary ASCII or Unicode text string—stores one hash ID.

The names are divided up into name spaces. Most names, by far, start with the literal string refs/, which is then typically followed by a name-space-selector string such as heads/ or tags/ or remotes/. These three distinguished names represent branch names, tag names, and remote-tracking names respectively.

Note that, regardless of the name-space in which the name resides, all of these entries are purely local to your own Git repository. Any other Git repository has its own separate names database. The names in your names database need not match the names in their names database.

Note: the objects in your Git object database exist as local copies as well, but the hash IDs in your database match the hash IDs in every other Git database, in that if you and they have an object with, e.g., hash ID c48035d29b4e524aed3a32f0403676f0d9128863, it's literally the same object. That is, the Git hash ID is a universally (or globally) unique identifier or UUID. While this can't actually work forever (see How does the newly found SHA-1 collision affect Git?), it works well enough in practice, and it allows two different Git repositories to exchange only new objects during a git fetch or git push operation. The key observation here—this is not relevant to your question but is something every Git user should know—is this: objects are shared across different Git repositories, but names are not.

Historic Git

Early Git did not have remote-tracking names; in fact, it did not even have remotes at all. You never had an origin, you just had to type in the URL over and over again. This was very obviously a problem and several different solutions were attempted. Remnants of these attempts live on: see the git fetch documentation and scroll down to the REMOTES section, where the documentation mentions three ways to list a "remote". The first method—a remote in the Git configuration file—is the only method one should use in modern Git.

Before remotes existed, though, one would run:

git fetch <url>

with an explicit URL every time, and your Git software would call up the other Git software as usual and they'd list out their refs. Your Git would pore over their ref-names, as listed by their Git software, and select which one(s) seemed interesting, and observe the corresponding object UUIDs. Then your Git would ask their Git to package up and send any missing objects, and would write, to .git/FETCH_HEAD, the various branch and tag names that it found interesting, plus their hash IDs and such.

The git fetch command still does all of this to this day. The FETCH_HEAD file format is unchanged from this early proto-Git. The two other long-dead methods of defining a "remote" still work too. This all tells us something important about the Git authors' dedication to backwards compatibility: they keep it long past the point of usefulness, even if it seems weird. This is one of the keys to your answer.

Now, having run git fetch, we tend to like to do something with any new commits we've obtained. This means running git merge or git rebase. As a convenience, long ago, git pull was a simple shell script that:

ran git fetch with URL and any additional branch name arguments, then
fished out the interesting line(s) from .git/FETCH_HEAD and ran git merge with one or more raw hash IDs and an appropriate -m option to produce a merge message.

(The script was made fancier pretty quickly, to be able to run git rebase instead of git merge if you wanted, and now—more than a decade later—is a fancy C program that incorporates all of the fetch, merge, and rebase code in a single executable. But it's still backwards compatible!)

Now, in those old days, to keep everything simple, the git fetch command would write not-for-merge to each .git/FETCH_HEAD line that was not supposed to have a hash ID passed to git merge, and would leave that line out for those lines that had a hash ID that was supposed to be passed to git merge. This meant a simple grep -v not-for-merge picked out the right lines. But to put those lines into .git/FETCH_HEAD, git fetch needed to know which branch name was "for merge". So, once remotes were invented, git fetch would use these two settings:

branch.xyzzy.remote = origin
branch.xyzzy.merge = refs/heads/xyzzy

That told git fetch and/or git pull:

which remote to use to get the URL;
which branch (refs/heads/xyzzy) was the interesting one, "for merge", on that remote.

This explains why the upstream setting in .git/config is convoluted. It's spelled out in two parts, with the remote value for git fetch or git pull to look up early, and with the merge setting for git fetch to look up later while writing .git/FETCH_HEAD. The merge setting in particular lists the name of the branch as seen on the remote.

Remote-tracking names and refspecs

With remotes came remote-tracking names.² The idea behind the remote-tracking name is simple enough: now that we have a name for another Git repository, such as origin, why not have git fetch record their branch names and hash IDs locally? Sure, the values are in .git/FETCH_HEAD after a fetch, but then they're overwritten as soon as we fetch from some other Git repository.³

Of course, our branch names are our branch names. We can't have our Git overwrite out branch names! They're ours! So we have to have some other form of name, i.e., the remote-tracking form. We'll take "their" branch name xyzzy, i.e., refs/heads/xyzzy, and turn it into our remote-tracking name origin/xyzzy, i.e., refs/remotes/origin/xyzzy. By doing this for every branch name on the remote, we guarantee that our branch names and their branch names never collide. That's the essential function of a namespace in the first place (see the Wikipedia article again if necessary).

But we need a mechanism for doing this name-rewrite. One obvious choice for this mechanism would be just to hard-code it. For whatever reason,⁴ this isn't what Git ended up with. Instead, the remote setting provides what Git calls a refspec, which is that refs/heads/*:refs/remotes/origin/* that you observed in your question.

This refspec maps names in one repository to names in another repository. By listing more than one refspec, e.g.:

fetch =  refs/heads/*:refs/remotes/origin/*
fetch =  refs/notes/*:refs/remotenotes/origin/*

we can get Git to copy more names. Whatever we want, refspecs can do it—well, up to some point, depending on how fancy the * match is.⁵

²I was late enough coming to Git, around 1.5.7 perhaps, that I don't know if they came in two separate steps, though I think this was the case. They were still pretty close in time, at least.

³You can use git fetch --append, aka git fetch -a, to append to .git/FETCH_HEAD; this clearly existed for use before remote-tracking names were invented. As with the other backwards-compatibility features—including .git/FETCH_HEAD itself, at this point—it no longer serves any real purpose.

⁴I don't know the true reason, but I can speculate: at the time remotes and remote-tracking names were being discussed and developed, I have no doubt that Git mirrors were also being discussed and developed. The mechanism chosen serves both purposes.

⁵Originally, * in refspec could only match whole "component names", spanning slashes but always coming "adjacent to" some slash, e.g., refs/heads/* is allowed and includes names like refs/heads/feature/short. This was later extended to allow "partial component" matches, such as refs/heads/pr-*. The rules for refspec matching are messy overall and Git gives you an error if a source name doesn't map to exactly one destination name.

Putting it all together

All of this history, plus the desire for backwards compatibility, produces the end result you've seen. The upstream setting of a branch name, recorded in two parts in the .git/config file, refers to:

the named remote, or . if the upstream is a local branch instead of a remote-tracking name; and
the branch name as seen on that remote (or when the remote is ., it's the branch name as seen locally).

But Git records that branch name, as seen on that remote, under a remote-tracking name. To compute the remote-tracking name—e.g., origin/xyzzy—from the branch.xyzzy.merge setting, Git must run the name through the refspec mapping function. That refspec mapping function is determined by the fetch lines in the remote!

So, if and when you change the fetch lines, you must also rename any remote-tracking names. This renaming must be nondestructive: e.g., it's OK to rename origin to fred if there's no existing remote named fred since there are no refs/remotes/fred/* names at this time, but it's not OK to rename origin to the empty string.

The git remote and git branch commands enforce enough rules to make this all work. But if you short-cut them by editing .git/config yourself, you must understand the underlying mechanisms. You won't precisely break anything (as in, Git won't stop working) but you may get bizarre and undesirable results, if you violate the normal mechanisms here.