I have forked the obs-studio repo. Recently a submodule was updated: obs-vst
from 937ba79
to e6a59b3
. I used the method described here to update my fork to the latest version (by merging with an upstream remote).
After the merge I pushed the changes and I can see in GitHub that my fork points to the latest version of the submodule.
However if I run locally git status
it shows me that a file in the submodule is modified. This file is actually the only change between the old version and the new one. Also a git log
inside the submodule shows it's still pointing to the old commit.
I am really confused. How come that GitHub shows the good, updated version while locally I still have to commit the change that brings me to the new version?
- Do I have to commit this change?
- Where is this commit "going"? The submodule either is an old version or it's a new one. How can it have pending changes showing in my repo? Committing them wouldn't create a side history of the module?
- I want the exact version that the upstream repo uses. I used
git submodule update
but I have a hunch that this actually merges into the submodule it's master branch. What should I do to sync the submodules(even multiple of them at once)?
CodePudding user response:
Submodules are inherently confusing. Well, let's put this a bit more strongly: Git is inherently confusing, and submodules make it significantly worse.
There are multiple ways in which Git is confusing, but I find that one of the bigger ones for many people is the idea of distributed version control, in which there are multiple copies of any one repository. If you get off to the right start, this is not so bad, but many people don't get off to the right start. You need to know the following dozen items just to get started:
A Git repository is primarily a collection of commits. Git is not about files (though commits contain files), and is not about branches (though branch names help us—and Git—find the commits). Git is about the commits. There are other supporting object as well, but commits are the main ones of interest.
Commits (and other objects) are numbered. Each commit gets a unique number. These numbered entities hold, as their content, snapshots and metadata; these contents are frozen for all time. While the numbers are unique for each commit, every Git repository in the universe magically agrees that these are the right numbers, even if they have not seen the numbers and the commits yet.
The metadata of any one commit holds a list of the numbers of previous commits. So given one commit, Git can work backwards to its parent commit or commits. Conventionally—this convention can only be broken in carefully limited ways under particular circumstances that we must initially ignore—the existence of a commit in the database implies that the listed parents are also present in the database.
This means that given a starting (ending?) point, and a database full of commits, Git can trace its way from that starting (ending?) point backwards through all of the history that is reachable from this point, all the way to the beginning of time (the ending or starting point, depending on whether you think backwards like Git does, or forwards like people do). Because the linkage inside commits is purely backwards, we may need multiple starting/ending points to find all the commits. That is, we have a directed acyclic graph or DAG and to reach everything, we need entry points into the DAG.
A Git repository therefore has two databases. One, indexed by number (hash ID), finds commits and other supporting objects. The other, which uses names—branch names, tag names, and all kinds of other names—finds our starting points: entries into the DAG. These allow the repository to find all the findable (i.e., reachable) commits. The very long and slow process of starting at all these starting points, working backwards through time from later to earlier commits, and visiting every reachable commit—enumerating the graph—determines what's valid data. Anything not reachable this way is junk that can eventually be discarded.
Each repository is independent of all other repositories. Adding commits to one repository does not add them to other repositories. Moreover, the names in each repository—branch names, tag names, and so on—are private to that repository (although the repository's controlling software chooses to expose at least the branch and tag names for other repositories to read, and may allow certain update requests as well).
We connect repositories together with
git fetch
andgit push
. These allow one repository to see another repository's names, hash IDs, and/or objects (commits and supporting objects). This also provides a sort of "have/want" protocol, where one repository will be the sender and one the receiver. The sender will expose one or more names and hash IDs, telling the receiver that the sender has those objects. The receiver replies with either "I have that already"—which, for a commit, implies that the receiver not only has that object but also every reachable parent object as well—or "I would like that object", which for a commit, obliges the sender to offer its parent(s) as well; the receiver gets to say "have" or "want" as before. The sender then sends all of the commits and supporting objects needed by the receiver, as implied by the have and want responses.Here, fetch and push diverge: with
git fetch
the sender is now done because the receiver got the objects (commits and supporting objects) and knows the sender's branch and/or tag and/or other name(s) by which the sender found and offered those objects' hash IDs, during the initial setup. The receiver can now create or update remote-tracking names for each branch name, and/or choose whether to create tag names, and so on. Forgit push
, however, the sending process ends with the sender delivering either a polite request, or a forceful command, asking or commanding the receiver to update some of its name(s). By this method, the person (assuming it's a human) runninggit push
typically asks or commands the receiving Git to create or update one or more of its branch names.A special name,
HEAD
, denotes the "current branch" or "default branch" in a repository. For repositories that people use to get work done—non-bare repositories—HEAD
remembers the current branch name or the current commit (see item 10); for server-only repositories that exist just to be cloned,HEAD
holds the preferred/default branch name. Thegit clone
operation that people use to make a copy of a repository will rungit fetch
, and will receive this name from the sending Git as the name that the receiving Git should use (see item 11).In a non-bare repository, there's also a working tree (and an index or staging area). These are where people get actual work done. They fill the index and working tree from a particular commit: one that already exists in that repository, and therefore is findable. Most often, that one commit has been found through a branch name and is the hash ID that is stored in that branch name, which we call the tip commit of the branch. As we do work and make new commits, each one gets a new unique hash ID, which Git then stuffs into the branch name; the new commit has as its parent (or for merge commits, first parent) the hash ID of the commit we had checked out while we were in the process of making the new commit, i.e., the hash ID stored in the branch name up until this final "overwrite the name" step.
Cloning a repository simply means that we follow a five or six step process where we:
- make a new empty directory;
- initialize it as a Git repository (creating two empty databases and, for normal use cases, reserving the previously-empty directory we just made as the working tree as well);
- use
git remote add
(normally with the nameorigin
) to create a remote forgit fetch
; - do any other configuration steps necessary (often none);
- run
git fetch
to populate the objects database and to take each of the original repository's branch names and turn them into remote-tracking names in the names database; and finally - create one branch name and check out that branch to fill the working tree.
The branch name we create, in the final step here, is from the
-b
argument togit clone
, but lacking a-b
argument, it's the default branch name from the repository atorigin
.After a
git clone
, then, we have all the original repository's commits (well, all the reachable ones) and none of their branches. We have instead one branch that is our branch name, not theirs. What were their branches are now our remote-tracking names: our Git software's way of remembering, in our names database, their branch names and corresponding hash IDs.All the stuff you need to know for everyday Git use, i.e., how to view commits, how to use
git switch
orgit checkout
, how to usegit diff
, and so on.
Besides having to keep all of this in your head—it's quite a juggling act initially—there's another fact that makes things hard: You can't really see a commit. Or rather, you can, but what you see depends on how you look at it. (If you're familiar with Plato's Cave, think about that here.) In particular, git show
, which is a common way to look at a commit, shows a diff. But that's not what's in a commit: each commit holds a full snapshot of every file, not a set of changes since a previous commit.
What git show
is doing is using the commit's metadata, which contains the parent commit's hash ID, to extract both the parent's snapshot and the commit's snapshot. As the files inside each commit are magically (via hash ID, actually) de-duplicated in advance and at all times, Git can very quickly identify all the files in the two commits that are 100% identical and toss them out of a list of "all files to compare". That leaves only the changed files that really need to be extracted in the first place. Git extracts those and runs them through comparison software: a diff command. This git diff
operation, which is run on demand every time you ask to look at a commit, tells you what's different. The commit doesn't contain the difference. Rather, the git diff
command, when run, computes the difference.
Aside from these minor issues (ahem - Aside from that, Mrs Lincoln, how was the play? ), things often aren't too horrible. We get used to the idea of "check out / git switch
, work, git add
, git commit
, git push
" and we hardly ever even have to notice the big ugly hash IDs. As long as we're the only one working on / with the usable copy of the repository, with the occasional git push
to GitHub as a sort of poor man's backup system or something, all the details of hash IDs, branch name mappings, and so on kind of fade into the background. As soon as we start using submodules, though, they come roaring back.
Worse, when we have submodules, we have multiplied the number of Git repositories involved. In your case, you now have:
- the original obs-studio repository;
- your fork of the obs-studio repository on GitHub;
- your clone of your fork of the obs-studio on GitHub; and
- the two repositories (your "main" clone and your GitHub "backup" repository for the "main" clone) that are acting as superprojects that hold the submodules.
That's five repositories in all, here.
The main things to know about a submodule are these:
- A submodule is a separate Git repository, with all that this implies (see the dozen items above).
- A repository becomes a submodule by virtue of being referred to in some other Git repository. The referring repository is now a superproject.
- The superproject contains, in every commit in which the submodule is used, a raw hash ID for the submodule repository.
- There's one more piece needed. A raw commit hash ID is almost everything you need—remember, the hash IDs are unique—but it's not quite good enough.
If we could use a commit hash ID—which is, after all, unique, at least in theory—to find some clone somewhere that had that hash ID, why, that would be the right clone and the right hash ID. But we can't really do that.1 So, in order for Git to be able to run git clone
on the submodule, to obtain the clone that (we hope) will hold the right commit hash ID, Git needs a valid URL to feed to git clone
.
This URL goes into a file named .gitmodules
that we store in every (new) commit in the superproject. This file holds a path name and a URL. Then, in each commit in the superproject, in order to refer to the submodule, we have Git store an entity that Git calls a gitlink. The gitlink goes in the pathname where the repository should be cloned and have a commit checked out, and the hash ID stored in the gitlink is the hash ID of the commit to check out.
When you're at work in the superproject, your Git software will do things like:
(mkdir -p path/to/ && git clone <url> path/to/submodule)
(cd path/to/submodule && git switch --detach <commit-hash-ID>)
where the git clone
has some extra, fancier options that put the repository databases somewhere else (inside the superproject's .git
directory) and the commit-hash-ID
comes out of the gitlink. This is what git submodule update --init
does; without the --init
, it won't do the git clone
step. (With --init
, it doesn't bother with the git clone
step if the repository is already there.)
To update a submodule, we must:
Forget about the superproject for a while. It's a separate Git repository and it has nothing to do with the existence of commits in the submodule Git repository, after all.
Work with the submodule: make new commits, or obtain those commits from somewhere, or whatever. Make sure the submodule repository (and working tree) have the right commit checked out! If we want new commits to go on a branch name, make a branch name! If this submodule is cloned from somewhere, make sure any new commits we make are
git push
-ed if necessary, so that other people will be able togit clone
andgit checkout
this particular commit. Remember that the superproject is going to put the submodule in detached HEAD mode a lot, and we'll have to compensate.Now that we have the right commit selected in the submodule, return to the superproject and make a new commit in the superproject so that the superproject repository's current commit records the correct gitlink (i.e., hash ID) for the submodule. Run
git add path/to/submodule
to update the gitlink in Git's index; run any othergit add
s needed for the superproject as usual; and rungit commit
to make the new commit.
We have probably now created at least two new commits, one in one of the clones of the submodule, and one in one of the clones of the superproject. We must now carefully distribute all of these commits to all other clones that need them: there may be five such clones (the five we've enumerated above) or more.
In your particular case it seems that you have updated your fork already, which is the submodule repository you want to use, or maybe it's the superproject you want to use—you haven't really said. We don't know precisely how you updated your fork though. The instructions on GitHub talk about using another clone, and you don't show the actual commands you used, so we don't know which other clone(s) you used. What we do know is that there must be at least four or five repositories involved here.
You now need to update some more repositories:
Remember that each submodule repository will generally be in "detached HEAD" state, i.e., not on any branch. To bring an update in with
git fetch
, the remote repository for that submodule has to have the new commit, and the new commit is probably reachable via some branch name in the remote, so thatgit fetch
will obtain it, but if it's not reachable from a branch name in the remote, you might have to do something about that.Your superproject repository eventually needs to be told to commit an updated gitlink. To do that, make sure the submodule has the desired commit (by fetching if needed) and then enter the submodule repository's working tree and check out the desired commit if necessary, so that
git rev-parse HEAD
run in the submodule spits out the desired hash ID. Then return to the superproject andgit add
the path to the submodule, to record a new gitlink in the index for the working tree for the superproject clone. Then usegit commit
to make a new commit in the superproject, in the same way that one always makes commits in repositories.
To see the gitlink in the index and/or the commit, use git diff
, git submodule status
, git submodule summary
, and other such commands in the superproject. To see the raw hash ID in the submodule, use:
(cd path/to/submodule && git rev-parse HEAD)
because whatever commit hash ID HEAD
says when git rev-parse HEAD
is run in the submodule, that is the commit that is actually checked out right now in the submodule. That's the hash ID that git add
, run in the superproject, will record in the gitlink entry in the index in the superproject.
Keeping all of this straight is kind of a nightmare. There is a reason people refer to these as sob-modules. There are git submodule
commands that try to hide this complexity from you, but they mostly fail to hide it, so learn how it works so that you can see what's really going on.
1We can sort of do this with Google, and this is the idea behind things like IPFS, but for Git hash IDs in particular it doesn't work very well. Git hash IDs are allowed to be non-unique across repositories that never meet, and Google's indexing of commit hash IDs is not complete, for instance (and there are potential security issues as well).