New to git and want to be sure I understand what I'm doing.
Currently, I have a local repo pushing to a public remote repo. There are certain files and dir that I want to separate from the public remote repo and add to a private remote repo.
My plan is to add these files and dir to .gitignore
, create a new branch named privateRepo
, then link it to the private remote repo. Afterwards, switch to the branch, privateRepo
and add/commit/push the local contents.
git checkout -b privateRepo
git remote add privateRepo <url>
git switch privateRepo
git add <files & dir>
git commit -m "message"
git push
After this, I can use git switch main
to return to my main branch and push without influencing privateRepo
.
Is this correct?
Will this pose any problems if I want to pull from privateRepo
to my local repo?
CodePudding user response:
You can do what you're proposing. I would advise against it because:
- it doesn't work the way I think you think it does; and
- it's far too easy to accidentally send the wrong commits to the wrong "other Git repository", thereby publishing forever your private files.
To understand this, we need to cover the basics of what a repository is and does for you. A repository is, primarily, a collection of two databases:
One database holds commits and other supporting Git objects. This is the "object database". It is a simple key-value store in which the keys are hash IDs—commit hash IDs, and other supporting object hash IDs—and the values are the objects (the commits and the stuff commits use to store files forever).
The other database holds names. The keys in the object database—the hash IDs—are too big and ugly for humans to deal with. So we don't: we use names. Git keeps the second database around so that it can translate from name to hash ID, to find the commits (and other internal objects). Humans can deal with names, like
master
ormain
as a branch name, for instance.
When you use more than one Git repository and cross-connect them with git fetch
or git push
(note that git pull
is a convenience command that means run git fetch
, then run a second Git command, so it's really git fetch
in disguise), what you're doing is transferring the commits between the repositories. You have few options here: you either send a whole commit, or none of it. You either receive a whole commit, or none of it.1 If you do send, or receive, a commit, you also send or receive all its predecessor (ancestor) commits that you or they don't have.2
The upshot of this is that if you accidentally send one commit to the wrong (i.e., public) repository, you'll likely send all of them. Now they have every version of every private file you ever committed. You can try to take this back—GitHub make this somewhat difficult as you must contact GitHub support—but during the window where these commits are visible, anyone and everyone can copy them from the public repository.
If you, instead, split the files into "public availability files"—which you put in one repository and then share that repository with GitHub as a public repository there—and "private files" that go into a different repository, and then share that one with GitHub only as a private repository, the whole thing is much more manageable.
You can coordinate the two repositories with Git's submodules, as Ôrel mentioned in a comment. Submodules have their own headaches and drawbacks, enough so that people sometimes call them sob-modules, but they do achieve a proper public/private split.
1Git is growing a new facility called partial clones where this general all-or-nothing principle is carefully pulled apart, like a Jenga tower. Pull the wrong part and the whole thing collapses, though. This is not—at least currently—meant for the kind of thing you're talking about and I would not advise using it for that, unless you plan to work on Git itself to add that. (It could in theory be used for this sort of purpose.)
2This rule, too, can be carefully violated with shallow clones. Shallow clones are more mature than the partial clone code, but they still don't do the kind of thing you want.
Some more words about repositories
Many people think Git is about files. It's not: it's really about commits. The commits do, however, hold files. Or they'll think that Git is about branches. It's not: it's about commits. The branch names, however, do help you (and Git) find the commits. So branch names, and files, are important. But it's really all about the commits.
Because commits are stored in the object database, they're numbered, with those big ugly hash IDs. Every unique object gets a unique number. The commit numbers in particular must be unique across every Git repository in the universe, which means the numbers must be huge (currently there are 2160 possible numbers, and this will probably become 2256 in the relatively near future as it turns out 2160 is too small). That's why they are so big and ugly and random-looking, though actually they're entirely non-random: they are outputs from a cryptographic hash function (currently SHA-1; SHA-256 is the planned future).
Each commit, though, stores two things:
Every commit stores a full snapshot of every file, as of the form the file had at the time you (or whoever) made the commit. These are stored in a special, read-only, Git-only, compressed and de-duplicated format (as objects in the database), not as ordinary computer files.
Every commit stores some metadata, or information about the commit itself: who made it, when, and why (a log message) for instance. This metadata is as read-only as everything in the objects database. (The read-only quality comes from Git's hashing tricks, and is necessary to allow the file de-duplication as well.)
In the metadata for any one given commit, Git stores the raw hash ID(s) of a list of previous commit(s). Usually this list is just one element long. We call this single previous commit the parent of the commit. These parent IDs form backwards-looking chains:
... <-F <-G <-H
Here H
stands in for the hash ID of the last commit in the chain. Commit H
has, inside it, a full snapshot of all of your files, plus some metadata. The metadata in H
show you that you (or whoever) made the commit. They keep data such as your log message. And, they show that commit H
has one parent, whatever hash ID G
stands in for.
Commit G
, of course, also has a snapshot and metadata. Using H
, Git can find G
, and extract both snapshots and compare them. Whatever is the same did not change (and this is easy for Git to see because of the de-duplication via hashing). Whatever is different, here Git will need to run git diff
to figure out what changed. Git will do this if and when you ask for it, and you'll see what changed going from G
to H
, when you view commit H
.
Having viewed commit H
, a command like git log
now moves back one hop to commit G
. Commit G
has metadata, including the hash ID of earlier commit F
. Commit F
has a snapshot, so Git can compare F
-vs-G
to see what changed, and hence show you G
as changes, even though G
is a snapshot. Then git log
can move back one hop to F
, and repeat. The process ends only when Git has gone back to the very first commit ever, which—being the first commit—has no parent, or when you get tired of reading git log
output and make it quit.
Using commits: the work-tree and the index / staging-area
But there's a big problem here. If the content in a snapshot, the saved-for-all-time files, is read-only—and, worse, in a format that only Git itself can read in the first place—how will we ever use it?
The answer to this problem is simple enough. In a non-bare repository (which is to say, most of them), Git adds a working area, called a working tree or work-tree. To use a commit, Git simply copies all the files out of the commit, into your working tree. Now you can see your files and do your work.
It's important to realize that these files are not in Git. These are ordinary computer files, that all your ordinary computer programs can read and write and generally work with. They may have come out of Git. But at this point, they're not in Git any more. They're just files.
As you work with them, they may drift away from what's in Git. Their contents change. At some point, you might wish to take all the updated files and store them forever, in a new commit. In other, non-Git, version control systems, this is pretty easy: you run, e.g., hg commit
and Mercurial figures out what you changed and makes a new commit. In Git, this is not so easy.
Instead, Git adds a hidden extra "copy" of each file. I say "copy" in quotes here because this extra "copy" of each file is pre-Git-ified: it is stored in the compressed, de-duplicated format that Git uses internally. Since all these files initially came out of some commit, they're all duplicates, and therefore none of them take any space.3
When you tell Git make a new commit now, Git looks only at these hidden extra copies. So, before you make a new commit, you must run git add
. What git add
does is:
- read through the working tree copy of some file;
- compress and generally Git-ify it, coming up with an internal object hash ID;
- if that's a duplicate of some existing file, throw out the temporary stuff built up now and use the duplicate;
- otherwise, prep it for future committing and store it.
Either way, the git add
step takes the updated file and makes it ready to go into the next commit. This replaces the copy that was there, ready to go into the next commit. Or, if the file is all-new—if there was no file with this name before—then git add
has nothing to replace, and instead adds a new file, and now there is a copy, ready to go into the next commit.
In all cases, then, before git add
, Git had all the files ready to go into the new commit. After git add
, Git has all the files ready to go into the new commit. So Git always has all the files ready to go. What git add
does is replace one, or two, or many, of those ready-to-go "copies" with updated files (new copies if needed, or re-used old "copies" if possible) and add any new files.
The area in which this extra ready-to-go not-quite-a-commit-yet thing lives has three names in Git. This may be because the primary name, the index, is meaningless. The other main name, the staging area, refers to how you use the index. The third name, the cache, is mostly defunct, but still shows up in flags like git rm --cached
. I tend to use the name "index", but "staging area" has probably become the most common, and it's definitely how you use it: you "stage" the files by arranging them on/in a "staging area", ready to be "photographed" into a commit where they will live forever.
This staging area, or index, sits between the current commit and your working tree. Git's index is a lot like a commit, and is initially set up from a commit and later becomes a new commit, but the key difference between an actual commit and Git's index is that you can replace files in, add files to, and remove files from Git's index. You can't do that to a commit: the commit, once made, is set in stone.
When you do finally run git commit
, Git simply packages up the files that are in the index right then. Those become the snapshot for the new commit. So you must update the index. People are often tempted by git commit -a
, but there are some flaws with this and I advise users to avoid it (see below).
In any case, I find that it helps, to think of the index as being between commit and working tree, like this:
HEAD index work-tree
--------- --------- ---------
README.md README.md README.md
main.py main.py main.py
new.txt
Here, we checked out some commit that had two files in it, so HEAD
—the current commit—has those two files, and Git's index has those two files, and our working tree has those two files. Unlike the HEAD and index copies, we can actually see and use the working tree copies, but all three exist.
Then we made a new file, new.txt
, in the working tree. It doesn't exist in HEAD
, and it doesn't exist in Git's index. Now we come to an interesting special case.
3They take some space for name, hash ID, and a bunch of cache data that Git uses internally. The amount of space needed depends on the name lengths and the index format (there are multiple index format numbers) but it's generally pretty tiny, on the order of 100 bytes per file.
Tracked, untracked, and ignored files
The new file we made in the working tree is not in Git. Neither are the two other files we can see, but those two do have copies that are in Git, in the HEAD
commit, and a proposed new commit prepared in the index. But new.txt
is not in the index: it's not even in the proposed next commit.
At this point, new.txt
is what Git calls an untracked file. An untracked file is defined as any file that is not in Git's index right now.
Suppose, for instance, that we now run:
git rm --cached main.py
to produce this:
HEAD index work-tree
--------- --------- ---------
README.md README.md README.md
main.py main.py
new.txt
We didn't change the commit (we can't change its contents), but now main.py
is not in Git's index right now. The main.py
in our working tree has gone from tracked to untracked.
If we run git add main.py new.txt
right now, Git will:
- read the contents of
main.py
, compress them, and discover it's a duplicate and re-use the oldmain.py
in the index; - read the contents of
new.txt
, compress them, and probably discover it's new and make a new Git-ified copy prepared for commit, and put that in the index;
and now we will have:
HEAD index work-tree
--------- --------- ---------
README.md README.md README.md
main.py main.py main.py
new.txt new.txt
Now we have no untracked files.
The tracked-ness of a file is changeable. All we have to do is create or remove index entries and/or working tree files. Those files that are in both are tracked files; those files that are only in the working tree are untracked files. And that's all there is to that—almost.
Why .gitignore
does not mean what people think it means
A file that is in Git's index is never ignored.
An entry in .gitignore
, listing a file's name, tells Git, in part, if this file is untracked, don't complain about it. The git status
command is very useful, as it will tell us about what's in the index and what's not in the index, in terms of stuff we have forgotten to git add
for instance. That includes any file that's in the working tree, but not in the index.
But some things—like Python 2.x—generate a lot of working tree files that should never be committed. If git status
complained about all your *.pyc
files all the time, git status
could become unusable: the useful nuggets of "oh I forgot to git add
this one thing" would be buried in the useless "here are 5000 *.pyc
files you could add" messages.
To prevent this, we list the *.pyc
glob pattern in a .gitignore
file. This makes git status
shut up about these files.
It has one other effect as well. We can run git add .
or git add --all
or similar to do an en-masse "add everything". When we do use this, Git will skip any existing untracked file that's also listed in .gitignore
. But Git only skips the untracked files, not the tracked ones. The tracked files—the ones that are in Git's index—get updated. (This is actually usually what people want.)
So, .gitignore
is the wrong name. The file should be named something like .git-do-not-complain-about-these-files-when-they-are-untracked-and-also-if-they-are-untracked-and-I-use-an-en-masse-git-add-command-do-not-add-them-to-the-index-after-all
. But this name is ridiculous, as are several slightly shorter versions. So it's just called .gitignore
.
The key here is that it won't ignore files that are in the index. For a file to go into a commit, it has to be in Git's index. If it was in a commit and we checked the commit out, the file is now in Git's index. So once a file gets committed, it tends to sneak back into the index, even if we take it out now and then: any time we check out a commit that has the file, it's back in the index. And then its listing in .gitignore
has no effect.
Why (and when) to avoid git commit -a
What git commit -a
does is use a shortcut. It:
- runs
git add
for you, and then - does the
git commit
step.
You can also do things like:
git commit --only main.py
and:
git commit --include main.py
These all manipulate the index, then do the commit. However, they're quite tricky: they make Git make one or two extra and temporary index files for the duration of the commit. These extra index files can mess with pre-commit hooks. Whether, and when, and how much, they do mess with such hooks depends on several things, including how carefully the hook writer wrote their hook, whether they were even aware of this Git trick, and whether you use the --only
or --include
form of git commit
.4
That said, git commit -a
usually works fine (see footnote 4). But it is roughly equivalent to running git add -u
first, and then running git commit
. The -u
option to git add
will only update known files, not add new ones. When we use git add .
to en-masse add everything we've updated, this includes adding new files. The -a
option to git commit
can't add a new file.
Besides this, git commit -a
is just plain lazy. Sometimes laziness is a virtue, so this doesn't mean never use it. But git status
, followed by careful git add
operations—perhaps even git add -p
operations—followed by another git status
, followed by git diff --cached
(or git diff --staged
—these do exactly the same thing), will help you arrange to commit only the changes you want. Reading through the diff lets you compose a good commit message, or take notes at least, in another window, which you can then paste into the commit. This lets you make good, careful commits. To do this, you must generally avoid git commit -a
. So it's at least a bad habit, even if laziness is occasionally a virtue.
4git commit -a
uses the --include
mode internally. This is less disruptive than the --only
form, because --include
needs only two index files, not three, and the one used during the commit is the one that becomes the new index after the commit if the commit succeeds. The extra file is used only for the rollback case. The --only
form needs three: one for the commit, one for rollback, and one for success; all three have—at least potentially—different content.