When I do git log --all
, I found one interesting commit in the log:
commit 3a1a6bfbd936ea441ecf1f071e82f89c7e8bbf6c (replaced, origin/main)
What does the replaced
keyword mean in the parentheses? And how to trigger it?
CodePudding user response:
This means someone used git replace
.
What git replace
does is allow you to tell future Git operations that, instead of some original object, they ought to look instead at some replacement object. This paragraph covers how replacement works but does not tell you what this all means. The problem is that at this level, the meaning doesn't exist yet. It's like saying neutron capture causes the U-235 nucleus to fission into two lighter-weight nuclei, emitting two neutrons. True, but so what? Well, so, nuclear reactor or atomic bomb. We've gone from dry nuclear physics to serious consequences.
Git replacement are not quite so dramatic, fortunately. But a simple replacement can have huge consequences. The consequences it will have, in your repository, are not something we can determine in advance. All we can do is describe the idea behind replacements.
The idea behind replacements
Any Git object, once made, is read-only, and continues to exist in the repository as long as someone / something is using it. The reason for this read-only quality is that each object is found (or addressed, to use a fancy term) by its hash ID, in a key-value database whose keys are hash IDs and whose values are the hashed object. When Git extracts the object from the database, Git re-computes the hash, and verifies that the retrieved object's hash matches the key used to retrieve the object. This guarantees that the object data are not corrupt.1
If we make a mistake when we make a new commit, that nobody else is using right now, and detect our own mistake quickly, we can correct our mistake by quickly replacing our original commit with a new commit. Our original commit is found only by the hash ID stored in some branch name. If we make a new replacement commit for it, with the mistake corrected, the new commit will have some other, different hash ID. We store the new replacement commit's hash ID in the branch name (which is writable) and we're done: the "bad" commit is still there, but is unused. With no one using it, Git will eventually drop it entirely.2
That's fine for a new commit, whose hash ID is stored only in a single branch name. But what if the commit isn't so new? In particular, commit hash IDs get stored in later commits. If this "bad" commit is part of a commit chain, we have a problem.
Remember that commits form backwards-looking chains, found by a branch name that points to what Git calls the tip commit: the last commit in the chain. That is, given some series of commits, each with its own hash ID, we might draw them by using single uppercase letters to stand in for the hash IDs:
... <-F <-G <-H <--main
The name main
points to the tip commit, whose hash is H
. That commit points backwards to earlier commit G
. Commit G
points backwards to earlier commit F
, and so on.
If there's a fault in commit F
, we could try to do what git commit --amend
does: make a new and improved F'
and shove F
up out of the way:
F ...
/
... <-F'
But when we do that, existing commit G
—which literally contains the hash ID of existing commit F
and cannot be changed—still points to F
:
F <-G <-H <--main
/
... <-F'
Our simple attempt to amend F
doesn't work, because main
points, not to F
, but to H
. H
points to G
, and will do so forever. G
points to F
, and will do so forever. We can copy G
and H
to new-and-improved G'
and H'
:
F <-G <-H <--main
/
... <-F' <-G' <-H'
and having made three copies, we can now re-point the branch name main
:
F <-G <-H
/
... <-F' <-G' <-H' <--main
This is what git rebase
does. But it has the drawback that every commit after F
must also be copied. If there are complicated chains:
I--J <-- br1
/
...--F--G--H <-- main
\
K--L <-- br2
the whole thing rapidly becomes a nightmare of history rewriting, with the need to move multiple branch names. You can do this using git filter-branch
or git filter-repo
, but it's painful and not something you want to do frequently. This is where git replace
comes in.
1If the key used to retrieve the object, compared to the hash of the object, does not match, something happened to the data since they were originally written. The hash function is of no help in correcting the erroneous data, so at this point we're stuck with finding a good copy, presumably in another clone or a backup. That's why disk drives use, e.g., Reed-Solomon codes rather than cryptographic checksums. Git's job here is only to find corruption, not to fix it.
2This "eventually" is a maintenance operation. The newfangled git maintenance
command can be used to tune this stuff—that's the future direction for Git—but the actual dropping is done via git gc
or git gc --auto
, in existing Git usage. That works as follows:
git gc
runsgit reflog expire
.git reflog
scans reflogs, which contain reflog entries.- The reflog entries each have a date-and-time stamp, and a status ("reachable" or "unreachable") implied by the current hash ID stored in the corresponding ref.
- The status leads
git reflog expire
to one of two "expiry" values: reachable, for commits reachable from the current ref value, and unreachable, for commits not reachable this way. - If the age of the entry exceeds the expiry value—30 days for "unreachable", by default—the reflog entry is deleted.
This drops the last actual reference to the internal Git commit object, which can now be deleted via git prune
, which git gc
runs after git reflog expire
. So, running git commit --amend
right after git commit
pushes the "amended" commit off to the side, where it lingers for a minimum of 30 days thanks to reflog entries: one in the HEAD
reflog and one in the branch reflog. Once the reflog entries are gone, there really is no reference to the commit, and git prune
will prune it.
Replacements
The mechanism Git uses for replacements is simple. There's a relatively low level routine in Git to obtain an object from the objects database—that key-value store I mentioned earlier, where the keys are hash IDs and the values are objects. You give the key to the database lookup code and it fishes out the value.
Now, if you allow replacements—there are control knobs for this, at this level—then when you call the "get me an object, I have its hash ID" function, the lookup function will check to see if the object's hash ID exists as a name in the refs/replace/
namespace.
So: we can make a replacement commit F'
that is a new and improved version of F
. This commit has a hash ID, once we've written it to the object database. Let's say F
had hash ID aaaaaaa
, and F'
has hash ID bbbbbbb
(I've shortened them from 40 characters to 7 to make them easier to deal with, and real hash IDs are of course random looking).
We now store the hash ID bbbbbbb
under the name refs/replace/aaaaaaa
. That is, the hash ID of commit F
, whatever it is, becomes a refs/replace/
name. In that name we store the hash ID of the replacement commit, here bbbbbbb
.
When some other piece of Git software calls the "look up object" function with hash ID aaaaaaa
, that software notices that refs/replace/aaaaaaa
exists. That software reads the hash ID stored in refs/replace/aaaaaaa
and, instead of looking up (and error-checking) aaaaaaa
, it looks up (and error-checks) bbbbbbb
instead. It then returns the replacement object's content, instead of the original object's content.
This means that when git log
or git checkout
or any other Git command goes to use commit F
, it gets commit F'
instead. Hence we've successfully replaced commit F
without actually changing commit F
.3 The git log
command in particular makes sure to notice that this happened (the lookup routine will set a flag for git log
to see) and adds the replaced
notation that you saw.
3Note that this makes git gc
and git prune
have to work harder, because object F
is still referenced "for real", while F'
is referenced via the refs/replace/
name. Fortunately it suffices for git gc
to run with replacements disabled.
Seeing reality, and why that matters
If you want to see what's really in the database, without replacements, you can run git --no-replace-objects log
. This will make git log
call the "get an object" function with replacements disabled. You'll see the original history, not the replaced one.
To view the replacement objects, use git replace --list
(or git replace
with no arguments, which means --list
), or in software, git for-each-ref refs/replace
.
Note that when you clone a repository, the cloning process normally does not copy the refs/replace/
namespace. Using git push
also does not copy refs/replace/
names by default. So when you use git replace
to construct illusory history in your repository, this only affects your repository.
You can replace non-commit objects too. Because replacement is such a low level operation, you can use it for various interesting effects. It's always local though, unless you take special action to get refs/replace/
references into another repository too.
Note that using git filter-branch
and git filter-repo
will make the new repository with replacements honored (though git --no-replace-objects filter-branch
won't, and presumably there are similar things with filter-repo
). So one use for git replace
is to edit history until you have it looking the way you want others to see it. You then run an otherwise no-op filter operation, which "cements the new history in place" as it were, without requiring replacements (they're now embedded, and the originals are just gone). You then publish this new, different repository instead of the original.