Home > Software engineering >  Git: "mixed" submodules
Git: "mixed" submodules

Time:06-29

I'm aware of git submodules which dwell each in its own separate directory.

  1. But is there such thing as "mixed" submodules whose content is "merged" together?

For instance:

  • Submodule1 (path ./), consist of files a.txt, b.txt and directory C with the file 1.txt
  • Submodule2 (path ./), consist of files x.txt, y.txt and directory C with the file 2.txt
  • Resulting "mixed" repo of both submodules: files a.txt, b.txt, x.txt, y.txt and directory C with the files 1.txt, 2.txt
  1. If it is not implemented in git - is there a workaround to achieve this?

CodePudding user response:

The short answer is just "no": no, there's no such thing as a mixed submodule in Git, and there's no workaround either.

Longer

The root of the problem here is that there's a grouping that cannot be reshuffled:1

  • Git has a current commit, as found via the special name HEAD.
  • Git has an index or staging area (two terms for the same thing2).
  • Git has a working tree, which contains ordinary everyday files: that's where you do your work.

The reason for this is straightforward: committed files—files stored in commits—literally can't be changed, and are in a format that only Git itself can read.3 Git does this for multiple reasons, with the primary one being the de-duplication of files whose content is unchanged. Hence, even though every Git commit stores every file, a new commit that only changes one file adds at most one file to the storage system. (It may add no files, if it changes the file back: all duplicate files are shared with some earlier commit, and since we just said that only one file changed, all but one file is necessarily shared. But if that one file now matches a file in any earlier commit—such as a commit made last week or last year—Git will share the storage from that earlier file.

Thus, the committed files work like permanent archives:4 they save a snapshot, and later, you can get that snapshot of files back. To do that you must supply the correct hash ID to Git. We won't concern ourselves too much with the way you get the hash ID, except to mention that Git's submodules work by storing hash IDs. But the archive is literally that: an archive, not a usable set of files. So, having located an old archive of files, you then have Git "unpack" the archive itself, into a usable version of the files.

The usable version of the files is the working tree or work-tree. Being a tree, this working tree has a top level—a starting point folder or directory—that may contain sub-folders. Each sub-folder is another tree and as such, it could be another Git working tree. But usually—and always, initially—it isn't.

If a subfolder isn't another Git working tree, that subfolder is part of this tree. So sub/file1, sub/file2, sub/sub2/file, and so on must all be part of this repository's commit as extracted from the current commit in this repository, by this git checkout or git switch operation.

Git keeps track of all of this through its index or staging area. Unlike many more-traditional version control systems, Git doesn't just extract the commit into the working tree: instead, it effectively copies the commit to its index / staging-area as well. Because the index holds the de-duplicated file format, and all the files are duplicates, this doesn't really take much space at all.5

When you run git add on any given working tree file, Git reads the working tree copy, compresses it into a freezable format object, checks to see if that's a duplicate, and then either uses the duplicate or uses the new thing it made in order to do the checking. If that's a new file—a name that wasn't in your working tree before—it is also a new name in the index, and if that's an old file—a name that was there before—that kicks the previous duplicate out of the index. In either case, the index is ready to commit, both before git add, and after git add.6

Overall it's an elegant solution: a file in your working tree is tracked if and only if it exists in this index, and git commit simply writes a new commit from the files that are already there in the index. At the initial git checkout or git switch time, all three copies of each file—frozen committed HEAD copy, frozen-format-but-not-actually-frozen index copy, and usable working tree copy—all match. You modify the working tree copy however you like, and now two copies match: HEAD and index. You git add the modified working tree and now a different two copies match: index and working-tree. When you're done modifying the index, you run git commit, and Git makes a new commit; the new commit becomes the HEAD, and now HEAD and index match. If you git add-ed everything from your working tree, all three copies of everything match. If you deliberately skipped some git adds, you're now free to add those and make another commit. And, if you don't want to add all the changes you made to some file, you can use git add -p, or otherwise "patch" the index copy of some file, which you can then use to make a new commit.

But when we add submodules to the mix, everything gets tricky.


1You can—since Git 2.5—run git worktree add to add a new group of (HEAD, index, working-tree), but the working tree you add must be in a different location in the file system. You are also required to use a different branch name, or detached-HEAD mode, in the added working tree; this is for reasons we won't get into here.

2There's a third term, cache, that is rarely used these days, but still shows up in command-line flags: for instance git rm --cached refers to removing a file from the index / staging-area, without removing it from the working tree. Some commands, such as git diff, take either --staged or --cached, and some commands, such as git rm, still only allow the primeval-Git "cache" word here. This might be fixed someday so that you can git rm --staged, but it's still the state as of Git 2.37.

3Git's internal objects can be stored as either "loose objects" or "packed objects". Loose objects are merely zlib-compressed, so a lot of programs these days would theoretically be able to read these files, but packed objects are much more complicated; very few programs would know how to read one object out of a packed-objects file.

4No part of any commit can ever be changed, so the files in a commit are as permanent as the commit itself. A commit can, however, become "dead" in a way. The technical term for this is unreachable. Some—not all!—Git implementations eventually "garbage collect" these unreachable commits and hence literally discard them entirely; if and when that happens, that particular archive is gone. The general rule here, which is now weaker than it was in primeval Git, is that if you can look up the commit hash ID at all, you can get all of its files back. If you can't look up the commit hash ID, you can't get the files from this repository: try another clone.

5Technically, the index holds a blob hash ID for each file, along with a file name and mode and a whole bunch of cache data. There's also a staging slot number for each such file, for use during merge conflicts. This is slightly different from commits, which use tree objects to store file names and lack the cache data and staging slot. It comes to very roughly 100 bytes or so per file (depending on file name length) to store the index entries. Of course, once you've changed a file and git add-ed it, Git has to store the new blob object somewhere, if that's not a duplicate. (Git does not store this directly in the index. Instead, it goes into the objects database as usual. But that's just an implementation detail.)

6The one exception to this rule occurs when the index contains nonzero staging slot entries, which represent unmerged files. When this is the case, git commit fails with an error saying that you have unmerged files. Using git add tells Git to kick out the nonzero stage entries and put in a stage-zero entry, thereby resolving the unmerged state for that file.


Enter submodules, exit elegance

Given that the <HEAD, index, working-tree> triple represents frozen, semi-frozen, and liquid files for one commit from one repository, how do we mix in a second commit from a second repository? Git's current answer to this is to use submodules.

A submodule, in Git, is implemented as several things that must match up. Note that each submodule is a submodule within what Git calls a superproject, here.

  • There is a file named .gitmodules that goes at the root of the working tree of the superproject. This file contains instructions that Git can use to run git clone. That's basically all it's for!

  • There is, in each commit in the superproject, an entity that Git calls a gitlink. This entity supplies two things: the path name at which the submodule is to have some commit checked out, and a raw hash ID of the commit in the submodule repository.7

To make this work, Git will:

  • clone the submodule, if that hasn't been done yet, using the instructions in the .gitmodules file;
  • run git checkout or git switch --detach using a raw commit hash ID.

The git checkout / git switch fills in an index and working tree. It sets the HEAD of the <HEAD, index, working-tree> triple to the raw commit hash ID, fills in the index, and fills in the working tree. For all of this to work, that working tree must be an entire working tree for a Git repository. This means it must be a folder, presumably a sub-folder within the working tree of the superproject for which this submodule is a submodule.

When all of these conditions are in place, submodules work. The sub-folder within the working tree of the superproject isn't considered "part of" the superproject, because Git doesn't "track" directories at all: only files—and in this special submodule case, gitlinks—can go into an index / staging-area.

For your case, though, you want some set of files stored in the superproject to go into some files and sub-directories in the working tree. Then you want some set of files stored in a submodule to go into the same files and directories, or for more than one submodule to go into a single working tree. But Git has this rule: one index "tracks" one tree. Files that exist in the tree, but not in the index, are "untracked files". You want to have a situation in which some files in some tree are tracked in one index, and other files in that same tree are tracked in the other index. Git can't do that: it would have to read both indices and assign files to one or the other.


7Note how these two things—a pathname and a hash ID—are the same two things that any index entry records for any file. It's just that for regular files, those are the file's name, e.g., path/to/file.ext, and a blob hash ID. For a submodule, those are the submodule's path, e.g., module or sub/module, and a commit hash ID. A file entity in a staging slot has mode 100644 or mode 100755, and a gitlink entity has mode 160000.


What you can do

Depending on what, if anything, you want to do with these submodules and superproject, you might not care about correctly indexing the "submodule files". If that's the case, you can fake it. In fact, there's probably no reason to use submodules at all at this point:

  • Have your superproject in a Git repository as usual.
  • Check out a superproject commit as usual. You now have a working tree.
  • In that working tree, create an empty directory data/ or whatever, and create a .git sub-directory there if necessary or desired so that the superproject won't even attempt to look inside it, and/or use .gitignore in the superproject to avoid any such looking-inside.
  • Now clone and check out the commits you like from the "submodules", which might now just be separate Git repositories. Move or copy all the working tree files into the data/ directory.

Note that you can, if you like, use the existing checked-out "submodule" working trees as templates to know how to re-assign the "superproject" files back to the submodules if and when such files are updated. And you can eliminate the "superproject" entirely, and just have the two "submodules" (two independent Git repositories): instead of making a data/ you just make the target directory where you're going to combine working trees.

It's up to you, as the programmer, to decide what to do about name collisions: what if the two "submodules" both have files with the same name? It's up to you, as the programmer, to decide what to do about a name vacuum: what if you create, in the work area, a new file? How will you decide which "submodule" this file should be copied back to, in order to make a new commit in that "submodule"?

These are the questions you'll need to answer, after which it's just a Small Matter of Programming to write the tools you want. They won't be Git tools—not exactly anyway—but because you decided how they work, they will do the job you want. Or rather, they will do the job you tell them to.

  • Related