Home > Software engineering >  Do I need to manually commit/push Git submodule SHA while I am working inside the main repository?
Do I need to manually commit/push Git submodule SHA while I am working inside the main repository?

Time:08-02

Imagine there is a remote repository:

my_project/
    my_git_submodule/  # old SHA, not updated since I added the submodule
        some_common_files/
    src/
    tests/

Where my_git_submodule is a Git submodule. I've added it a year ago and when I browser files in the remote repository, I see that the Git submodule refers to a very old SHA that point to one year changes.

When I am working on my_project repository, I regularly call git submodule update --remote --init --recursive to have up-to-date files locally.

Do I need manually commit/push a fresh SHA for my Git submodule while I am working inside my main repository?

For example, I've updated some code in the "my_respository" and I will push code updates and commited new SHA. Or I must not touch Git submodule SHA manually?

CodePudding user response:

TL;DR

Yes.

(Or, slightly less TL;DR: both no and yes.)

Long, with details

First, just to get this out of the way: "remote repository" isn't terribly well defined in Git. All repositories are local. Presumably you mean "a repository that is local to a remote host", i.e., a repository that's not on your own laptop but rather on some other machine (e.g., some server to which you have some sort of access, perhaps very limited access).

Next: a submodule, in Git, is composed of two parts:

  • You have a repository. Your repository contains commits. At least one of the commits in your repository contains something called a gitlink: this is a reference to a commit in some additional repository.

    The existence of this gitlink is what causes your repository to be a superproject. (More precisely, your repository acts as a superproject any time you check out a commit that contains a gitlink. We then over-generalize this a bit to say that your repository is a superproject, even though it only acts like one when using one or more gitlinks.)

    Each gitlink is stored in a commit, as if it were an ordinary file. This means it has a file path, such as path/to/module. In your case the path is simply my_git_submodule. The "contents" (in quotes because they're stored in a peculiar way) of the gitlink is the raw hash ID of some commit. This is not a commit in the superproject. It is a commit in some other Git repository.

  • The other part of a submodule, which is only needed early on but should be provided at all times, is contained in a file named .gitmodules that should be stored in every commit in which the gitlink is also stored. (Remember, commits store files—or in this one special case, a gitlink—so that the commit's files are a full snapshot of everything, or on this case a full snapshot of all the files plus the one hash ID for the one submodule.)

    What's in that .gitmodules file are the instructions that Git will need, so that Git can run:

    git clone <url>
    

    to create an additional repository. This additional repository (which is secretly squirreled away inside a subdirectory of the hidden .git folder of the superproject, i.e., in .git/modules/... somewhere) should contain the commit called for by the gitlink.

When you run:

git submodule update --init

in your superproject—we'll leave off both the --recursive and the --remote on purpose here, for descriptive purposes—your Git software notices the gitlink, which has been copied into the index aka staging area for the superproject repository as part of a git checkout of a commit. It then checks to see if the submodule repository is cloned yet. Let's assume for the moment that it is not.

With --init, since the submodule repository isn't cloned yet, your superproject Git software will run git clone, and then register its presence for future submodule operations. Without --init, your superproject Git would just complain that the clone isn't there. Now that the clone exists, it can go on to run the update action.

On the other hand, if the submodule repository has already been cloned (and is thus registered), then git submodule update doesn't care if you use --init or not. It just activates the update action.

Gitlinks are like regular files (except that you can't see them)

As with any file, once you've added something to a commit, a gitlink "sticks around". That is, suppose you start with a new, totally-empty repository and create one initial commit with a README file:

$ mkdir new-repo && cd new-repo && git init
[messages about initializing new repo]
$ echo example > README
$ git add README
$ git commit -m initial
[main (root-commit) a123456] initial

Each new commit you add from now on has a full snapshot of every file, including this initial README. That only changes if you update the README and run git add, or run git rm README, or some other action that replaces the index / staging-area copy of README.

Every time you clone the repository and then check out some commit, Git fills in both your new clone's working tree and its own index / staging-area from the commit you checked-out. You then make any changes you want, run git add to copy the changes back into Git's index, and then run git commit to make a new commit from whatever is in the index.

This same logic applies to gitlinks. If you clone a repository that has a gitlink in it, and check out the commit with the gitlink, the gitlink goes into Git's index. For instance, the Git repository for Git itself contains a gitlink named sha1collisiondetection. If you clone the repository:

git clone https://github.com/git/git.git

you'll get a checkout, and in that checkout as it appears in Git's index, there's a sha1collision "file".

If you look at the checkout, you won't see a file, but you will see an empty directory (the git checkout step made a place to put the checkout of the submodule; see also this trick for storing an empty directory in a Git repository via the "empty submodule"). Unless and until you run git submodule update --init, Git won't even clone the repository, though you can see what it would clone if you did:

$ cat .gitmodules
[submodule "sha1collisiondetection"]
        path = sha1collisiondetection
        url = https://github.com/cr-marcstevens/sha1collisiondetection.git
        branch = master

and if you view the URL on GitHub and scroll down a bit, you'll see a folder icon with a white arrow in it labeled sha1collisiondetection @ 855827c (at least at the moment: the abbreviated hash ID here might change in some future commit). Or you can do this in your clone of the Git repository for Git:

$ git rev-parse :sha1collisiondetection
855827c583bc30645ba427885caa40c5b81764d2

That's the full hash ID, stored in the gitlink.

So, you can't see the gitlink directly, at least not normally. It's sort of vaguely implied by certain things:

  • there's an empty directory; and
  • there's that folder-with-white-arrow in the GitHub web page view

and you can see it by using submodule oriented commands, including git submodule status and—sometimes, but not at this time—using git submodule summary:

$ git submodule status
-855827c583bc30645ba427885caa40c5b81764d2 sha1collisiondetection
$ git submodule summary
$

Finally, you can inspect the index directly using git ls-files --stage:

$ git ls-files --stage sha1collisiondetection
160000 855827c583bc30645ba427885caa40c5b81764d2 0       sha1collisiondetection

The mode here is 160000, which is what says that this is a gitlink. The hash ID is that of the commit that would be checked out, the staging number is zero as usual, and the path name in this case is sha1collisiondetection.

Since the above is in Git's index, if I made a new commit right now, the new commit would contain this gitlink. It doesn't matter that I have not cloned the repository: all that matters is that this gitlink entry exists in the index. It will remain there until I tell Git to do something about that.

git submodule update --remote

What git submodule update does is:

  • read through Git's index to find all your gitlinks;
  • for the "active" submodules (those already cloned and registered as such), read the raw hash ID, enter the submodule in question—the working tree may or may not be populated yet but the directory should exist—and run git switch --detach hash-ID.

Adding --init makes Git clone and activate the submodule first if necessary. Note that Git still does a detached-HEAD style checkout in the submodule!

Adding --remote changes the final git switch --detach command in one very specific way:

  • the superproject Git enters the submodule repository (not the working tree, but rather the repository itself, hidden away in .git/modules/ in the superproject) and runs git fetch;
  • the git fetch brings over new commits for that repository (for the submodule), from origin, and updates remote-tracking names like origin/main or origin/master or whatever; and then
  • the superproject Git reads the updated name for whichever branch is registered as the submodule's branch.

This is one of the rare places where the branch for the submodule actually has any meaning. The --remote operation does this git fetch to update the remote-tracking names and that updates the submodule's origin/main or origin/master or whatever. Now that this is updated, the superproject Git gets that hash ID, whatever it is, and uses it for the git switch --detach operation in the submodule working tree.

What this does is put the submodule itself on a different detached HEAD (if the remote-tracking name's hash ID is different, that is). Running git submodule status or git submodule summary in the superproject will now show you what's different between what the superproject's index calls for (855827c... perhaps), and what is actually checked out now (a123456 maybe?).

Using git add to update a gitlink

Now that the submodule itself is on some other commit, you'll generally want to update the gitlink in your superproject repository's index. To do this, you do the same thing you'd do with any updated file: you run git add. So:

git add sha1colllisiondetection

or in your case:

git add my_git_submodule

The superproject Git notices that this is a submodule, enters the submodule, runs git rev-parse HEAD to find the correct hash ID, and updates the superproject's index so that the recorded gitlink, ready to go into the next commit, records the hash ID that you checked out in the submodule and now have as its HEAD (current) commit.

This also works if you enter the submodule and check out a branch by name and do work as usual and make a new commit. The current commit's' hash ID is the one that goes into the updated gitlink, even if that's a new commit you just made. Remember to git push this new commit from your submodule clone, so that others will be able to clone it from GitHub or wherever!

(Historical note: In the bad old days of Git 1.5 or so, you had to be very careful here not to git add my_git_submodule/. If you did that, Git added all the files in the submodule into the superproject's index. Git no longer does that, but when I was working with such things back in the early 2000s, we had a few terrible accidents. Sob-modules: they made you cry.)

Conclusion

When I am working on my_project repository, I regularly call git submodule update --remote --init --recursive to have up-to-date files locally.

The --init is only needed once (or once after a fresh clone). The --remote does the action mentioned above: a git fetch and git checkout in the submodule. The --recursive is useful if the submodule is itself a superproject for additional submodules: it just makes the submodule itself repeat the action you've just taken (i.e., git submodule update --remote --init --recursive).

If that's moved my_git_submodule to another commit and you wish to record this new commit so that other clones can use the recorded hash ID (rather than forcing other clones to do their own git submodule update --remote), you'll need to git add my_git_submodule at this point, and eventually, commit that. This "ties everything together" so that the superproject clone specifies the correct commit in each submodule (which, if it's a superproject, presumably specifies the correct commit in its submodules, and so on).

If nobody is ever going to look at the commit hash IDs stored in the gitlinks, you don't need to do this at all. But then you don't really need submodules at all (it's just that they are all that Git has here, so that you might end up using them anyway).

  • Related