When I pull submodule using git submodule update --init --force --remote
it creates new files containing git diff for example
diff --git a/app/Services/Payment b/app/Services/Payment
index 72602bc..a726378 160000
--- a/app/Services/Payment
b/app/Services/Payment
@@ -1 1 @@
-Subproject commit 72602bc5d9e7cef136043791242dfdcfd979370c
Subproject commit a7263787e5515abe18e7cfe76af0f26d9f62ceb4
I don't what are these files and how to get rid of them and when I remove them the sobmodule checkout to the old commit
CodePudding user response:
A git submodule update
should not "generate" any file, beside the submodule folder content.
A git diff
i, the parent repository might show you what you mention, as seen in "Starting with Submodules"
If you run
git diff
on that, you see something interesting:$ git diff --cached DbConnector diff --git a/DbConnector b/DbConnector new file mode 160000 index 0000000..c3f01dc --- /dev/null b/DbConnector @@ -0,0 1 @@ Subproject commit c3f01dc8862123d317dd46284b05b6892c7b29bc
Although
DbConnector
is a subdirectory in your working directory, Git sees it as a submodule and doesn’t track its contents when you’re not in that directory.
Instead, Git sees it as a particular commit from that repository.
CodePudding user response:
TL;DR
Your issue here is the use of --remote
. Stop doing that.
Long
You mention in a comment on VonC's answer that:
When I [run]
git status
[I get]modified: app/Services/Notification (new commits) modified: app/Services/Payment (new commits) modified: database/migrations (new commits)
The (new commits)
part means: the commit hash ID your submodule is actively using (through its current checkout) differs from the commit hash ID your index (proposed next commit) says should be used.
There's a lot of jargon here ("submodules", "gitlinks", "index", "commit hash ID") and hence a lot to unpack. We'll get to this in just a moment.
Note that the output of git status
above is a more-compact representation of the output of git diff
that you quoted in your original question:
diff --git a/app/Services/Payment b/app/Services/Payment index 72602bc..a726378 160000 --- a/app/Services/Payment b/app/Services/Payment @@ -1 1 @@ -Subproject commit 72602bc5d9e7cef136043791242dfdcfd979370c Subproject commit a7263787e5515abe18e7cfe76af0f26d9f62ceb4
What we see here is that for app/Services/Payment
, your (main, top-level, "or superproject" repository's index says that this particular submodule should use commit 72602bc5d9e7cef136043791242dfdcfd979370c
. But it's actually using commit a7263787e5515abe18e7cfe76af0f26d9f62ceb4
instead. We've just added one more jargon term to define: superproject.
Some initial definitions
Let's start with the definition of a Git repository. A repository is, at its heart, a pair of databases. One is a database of commits and other internal Git objects. The other database holds names—human-readable names, because the names Git uses for its own objects are incomprehensible.
A commit is one of the four types of internal objects that Git stores in the first—usually much larger—database. These commits are numbered, with very large numbers that range up to 2160-1. These numbers are expressed in hexadecimal, as, e.g., 72602bc5d9e7cef136043791242dfdcfd979370c
. (The commits are the only ones you normally interact with in the way we're about to describe, so we'll just conveniently ignore the remaining three, but they're also all numbered.)
The numbers look random, though they're actually the output of a cryptographic hashing function and hence entirely non-random. The fact that they come out of a hash function is why we call them hash IDs too. But the real point here is that they seem to be totally scrambled, and no human is ever going to remember them. We need a computer for that.
Fortunately, we have a computer. We simply have the computer remember these hash IDs for us, using things like branch names and tag names. Each commit also stores, within itself, the hash ID(s) or some previous commits. We don't really need to worry about that right here, but this is how branches really work in Git.
So:
- a repository is
- a pair of databases, where one database holds commits
- which have hash IDs or big ugly numbers.
We and Git use the second database, of names, to find the hash IDs of particular commits, and we use the commits to find more hash IDs of more commits, and so on.
Commits are read-only: the working tree and the index
Now, a crucial thing to know about these commits—and indeed all of Git's internal objects—is that they are all read only. They have to be, because of the hashing trick: the hash ID is a function of every single bit that goes into the internal object, and we find the object by the hash ID, so the hash ID must always match. If the hash ID of some object we extract from the database doesn't match the hash ID we used to find it in the database, Git decides the database is corrupt.1
So the commits are completely read-only. Not only that, but the files inside each commit—we didn't define this earlier, but each commit holds a full snapshot of every file—are in a special Git-only format, compressed and de-duplicated, that only Git can read. (Literally nothing can write over them since everything is read-only.)
What this means is that just to use some commit, we must extract that commit. Git will extract a commit by:
- reading the compressed and Git-ified files that are inside the commit;
- expanding them into ordinary read/write files; and
- writing out those files into a working tree.
This working tree—another bit of jargon—is where we actually do our work. Here, we can see, read, and even write to files. They exist as files, not as read-only, Git-only database entries. So, now we can get work done.
The working tree also enables us to make new commits, but here, Git inserts an extra stumbling block. Before Git will allow us to make a new commit, Git requires that we copy any updated files back into Git.
This step actually makes a certain amount of sense, because the files we see and work on / with in our working tree are not in Git at all. They may have been copied out of Git (out of a commit or one of its supporting objects) but once they are out, they are out.
Git calls the place that Git makes us re-copy updated files by three different names: the index, which as a name makes no sense by itself; the staging area, which refers to how we and Git use the index—and the cache, which is hardly ever used any more but still shows up as the flag in git rm --cached
for instance.
The index's role as staging area is pretty straightforward. It takes on an expanded role during merge conflicts, but since we are not worried about these here, we'll just look at how we and Git use it as a staging area.
When we first check out a commit—with git checkout
or git switch
—Git needs to expand out all the compressed and Git-ified files into our working tree. But Git secretly sticks a "copy" of each of these files into its index / staging-area. I put the word "copy" in quotes here because Git's internal file copies are all de-duplicated. This is why a Git repository doesn't become enormously fat even though every commit stores every file: most commits re-use most files, and in this case, the re-used file takes no space at all, because it's been de-duplicated away.
The same goes for these index "copies": they're duplicates, because the file in question is in the commit. So the index "copies" take no space.2 But the key for making a new commit is this: the index copies are exactly what is going to go into the next commit.
In other words, the index holds your proposed next commit. Right now, having done a "clean" checkout of some existing commit, the index matches the commit. But now you can modify some file(s) in the working tree, if you like. Once you have modified a working tree file, you're required to copy it back into Git's index. You do this with git add
, which:
- reads the working tree copy;
- compresses it and otherwise Git-ifies it;
- checks to see if the result is a duplicate; and
- if it is a duplicate, uses the original (throwing away the temporary Git-ified copy), otherwise uses the new Git-ified file, and uses this to update the index.
The result is that the index now contains your proposed next commit—just as it did before you ran git add
. It's just that now, your proposed next commit has been updated.
You repeat this for all files you intend to update: update them in the working tree, then, sooner or later, but always before running git commit
, run git add
as needed. The add
step updates your proposed next commit from whatever you are adding. (Note that a totally-new file goes into the index too, in this same way, it's just that it does not have to kick out some existing de-duplicated copy.)
Hence we now know two things:
- The working tree holds the useful copies of your files.
- The staging area—or index—holds the proposed next commit, which you update after you update the working tree.
When you do run git commit
, Git simply packages up whatever is in the index at that time and puts that into the new commit as the set of Git-ified, read-only, stored-forever, compressed and de-duplicated files.3
1What we can do at this point is currently rather limited. The most common approach to handling corruption is to throw away the database entirely and clone a new one from a good copy, which works fine since Git is distributed and every repository has thousands of copies "out there". Of course, it stops working if there's no other copy.
2They take a bit of space to hold the file's name, an internal blob hash ID, and a bunch of cache data—that's where the name cache comes in again—which typically amounts to a bit under 100 bytes per file: hardly anything these days.
3If you use git commit -a
, note that this is roughly equivalent to running:
git add -u
git commit
That is, all the -a
option really does is insert an "update" style git add
before committing. Git still builds the new commit out of the (updated-by-add) index. There are several technical complexities here though. These have to do with atomicity and the operation of Git hooks. Putting them all together means that if you do use pre-commit hooks, you must be very clever at writing these pre-commit hooks, and/or avoid using git commit -a
. This is not the place for the details, though.
Submodules lead to an explosion of Git repositories
Now that you know:
- what a repository is; and
- how the index and working tree work
we're just about ready to move on to Git's submodules.
The very shortest definition of a Git submodule is that it is another Git repository. This definition is perhaps a little too short, though. It leaves out a key item, so let's try again: A submodule is:
- a Git repository, where
- some other Git repository refers to this Git repository; and
- some other Git repository exercises some control over this Git repository.
We now know that there must be at least two Git repositories involved, and one repository is put into some sort of supervisory position over the other.
This is how we define the term superproject: a superproject is a Git repository that has a submodule. The superproject is the overseer / supervisor.
One superproject can be the superproject of multiple submodules. (This is the case for you: you have at least three submodules. So you have at least four Git repositories involved.)
A Git repository that is acting as a supervisor—playing the superproject role—can itself be a submodule for another Git repository. In this case, the "middle" repository is both submodule and superproject. I don't know if you have any of these: there's no evidence one way or another in your question.
Now, one thing about most Git repositories is this: they're clones of some other Git repository. We mostly work with a clone. So let's suppose that you have, as your superproject, your clone R1 of some repository R0. If your clone R1 is the superproject for three submodules, those three Git repositories are themselves probably clones of three more repositories. So we're suddenly talking about at least eight Git repositories here, in your basic question!
With eight or more repositories, things can rapidly become quite confusing. There's no longer the repository, the working tree, the index, and so on. Instead, there are eight repositories, four clones on your computer, four working trees, four Git index things, and so on.
We need to be able to talk about each repository, index, and working tree independently, even though they may be somewhat interdependent. This means we need names for each one. To simplify things somewhat, I'm going to use the name R for your superproject git clone
, S0 for one of the repositories representing app/Services/Payment
, and S1 for another of these.
How this all works
You cloned your superproject repository R from somewhere (from some repository R0), but after that, we can stop thinking about it for a while, so we'll just think about R itself. Your repository R has commits, and these commits contain files and so on.
You selected some commit in R to check out:
git checkout somebranch
The name somebranch
resolves to a raw commit hash ID H
, and this is the commit your Git fishes out of R to populate the index and working tree so that you can use R.
There are, as yet, no additional repositories. There is, however, a file named .gitmodules
that came out of commit H
in R. Moreover, commit H
lists some gitlinks. A gitlink is a special entry in a commit, and it contains two things:
- a path name, in this case
app/Services/Payment
, and - some commit hash ID
S
(in this case72602bc5d9e7cef136043791242dfdcfd979370c
).
These gitlinks go into the index in R. We'll just talk about this one particular gitlink.
If you now run git submodule update --init
(note the lack of --remote
here), your Git commands, operating on repository R, will notice this gitlink in the index. (There's no corresponding files, just the gitlink.)
Your superproject Git commands, executing this git submodule update
, will now notice that you haven't yet cloned any submodules, and—because of the --init
option—will run a git clone
command for you. This git clone
command needs a URL. The URL comes out of the .gitmodules
file.
The repository that Git clones at this point is repository S0 (perhaps over on GitHub: on some server anyway). The clone gets hidden away,4 creating a new repository S1. Your Git software now runs a git checkout
operation within S1 so as to copy a commit into a working tree and index.
The index for S1 is hidden away in the repository for S1, but the working tree for S1 is placed into app/Services/Payment
: the place you want the files you'll see and work with, from the submodule. So now the ordinary directory (or folder, if you prefer that term) app/Services/Payment
is full of ordinary files. These comprise the working tree for S1.
Your submodule S1 is now ready to use. We have three repositories we need to think about: R, S0, and S1. We have two staging areas / index-es: one that goes with R and one that goes with S1. We have two working trees to use, one that goes with R and one that goes with S1. The working tree for S1 is inside the working tree for R, but the R repository won't use it. Only the S1 repository will use it.
4In modern Git, the clone's .git
directory is stuffed into R in .git/modules/
. In ancient versions of Git, submodule clones go into a .git
right in the submodule path—in this case app/Services/Payment/.git
.
git submodule update --remote
The --remote
flag to git submodule update
tells it that instead of obeying the superproject gitlink—remember, this is an entry in the R index, under the name app/Services/Payment
, that currently holds hash ID 72602bc5d9e7cef136043791242dfdcfd979370c
—your Git software should enter submodule S1 and run:
git fetch origin
This reaches out to repository S0. Repository S0 has its own branch and tag names, and its own commits. Repository S1 was cloned from S0 earlier, but S0 might be updated any time. So the git fetch
step reaches out to the Git software that handles S0 and gets, from that Git, any new commits for S0 and puts them in your clone S1. Then, as the final step, git fetch origin
within S1 creates or updates all of the remote-tracking names in S1 that go with the branch names from S0.
This updates your (local) origin/master
, origin/develop
, origin/feature/tall
, and so on in your S1 based on the branch names as seen in S0. You now have, in S1, all the commits* from S0, and you know which commit they (S0) call the "latest" commit on their master
for instance.
What your git submodule update --remote
does now is turn your name origin/master
into a hash ID. The hash ID your S1 Git gets from this operation is not 72602bc5d9e7cef136043791242dfdcfd979370c
. It's actually a7263787e5515abe18e7cfe76af0f26d9f62ceb4
.
Your superproject Git now directs your S1 Git to run:
git checkout --detach a7263787e5515abe18e7cfe76af0f26d9f62ceb4
(or the same with git switch
; in any case it's all being done internally in the latest versions of Git, though older ones literally run git checkout
here).
This populates your S1 index and working tree from commit a7263787e5515abe18e7cfe76af0f26d9f62ceb4
. So that's now the current commit in your S1.
Meanwhile, your superproject repository R still calls for commit 72602bc5d9e7cef136043791242dfdcfd979370c
. That's what is in the index / staging-area for new commits you will make in R.
What to do about all this
If you want R to start calling for a7263787e5515abe18e7cfe76af0f26d9f62ceb4
, you will simply need to run:
git add app/Services/Payment
while working in R. This directs the R Git to run git rev-parse HEAD
inside the S1 Git, which finds the current checked-out commit's hash ID. This hash ID then goes into the R index / staging-area, so that the next commit you make in R will call for that commit by that hash ID.
If you want S to have commit 72602bc5d9e7cef136043791242dfdcfd979370c
checked out instead, you have a number of options:
(cd app/Services/Payment && git checkout --detach 72602bc5d9e7cef136043791242dfdcfd979370c)
will do it, for instance. Or you can run git submodule update
. This command, run in R, tells the R Git to read the commit hash IDs from the R index and run git checkout
commands within each submodule, to force the submodule checkout back to the desired commit.
When you run git submodule update --init
, if you add --remote
, you're directing your R Git to fetch in each submodule and find the latest commit from some branch in the source repository (S0 in our examples here). The chosen branch is defined in various places in R, although it tends to be master
or main
these days. The same goes for git submodule update
without --init
. The --init
merely means do the initial clone if needed. The --remote
part means do the fetch and get the hash ID from a remote-tracking name. The crucial part is always the hash ID. That comes from:
- your index, or
- some remote-tracking name
and that controls which commit your Git instructs the submodule Git to check out.
The git status
and git diff
commands, run in R, merely report whether the index (R's index) and working tree (S1's working tree checkout in this case) match. If not, git diff
tells you what the difference is, and git status
just says "they are different".