I fear it's a really stupid question, but even after searching the web, I haven't really found anything on that topic.
I'm starting to learn python, and I have created a first, simple project. For that I created a directory called ~/workspace
. Within workspace, I created another directory for the project. So I got ~/workspace/project
where all my python files are.
I also wanted to start using git to learn about the version control and keeping track of changes made. So for that I created a private repository on github.com.
When I tried to git clone https://github.com/username/project .
in ~/workspace/project
it tells me that the directory is not empty and I can't clone the project.
So I created another directory ~/git/project
and ran the git clone there. So that means, I have copy over all the files from the working directory to the local repository and then git add
?
If this is the way to do it, what is the best practice to keep track of all the changes made to the working directory in ~/workspace/project
that need to be copied over?
CodePudding user response:
I think the crux of your issue is a misunderstanding: a Git repository is not a collection of files. It's a collection of commits.
Technically, a standard repository has three main parts (plus many smaller bits and pieces):
There's a big key-value database that stores all the commits, plus other internal objects that Git needs to make the commits actually work. The commits are what Git is about. Each one is numbered, but its number is weird and confusing: we don't have commit #1 followed by commit #2 and so on. Instead, each one gets a random-looking (but not actually random), huge, incomprehensible gobbledygook value like
9bf691b78cf906751e65d65ba0c6ffdcd9a5a12c
. Git calls these hash IDs, or more formally, Object IDs or OIDs.Git desperately needs the hash ID to find a commit. Git is helpless without the hash ID. So you'd have to memorize all these crazy hash IDs, which is obviously bad. To avoid that problem—of having to write down hash IDs, or maybe store them in files or something, Git has:
There's a second (usually much smaller) key-value database where the keys are names: branch names, tag names, and many other kinds of names. Each name stores just one hash ID, which seems like it wouldn't be enough, but actually it is.
Finally, there's a sort of work area, a place where you can get work done. This is your working tree or work-tree and that's where you see files. These files get copied out of Git, and later copied back into Git, but while you're working on them, they're just ordinary files, and they're not actually in Git at all.
When you run git clone
, Git creates a new repository. That's a new set of all three of these things: the two databases plus the working tree. Git requires that this working tree be empty, because after creating the new repository, Git is going to start filling in the two databases.
When you run git init
, by contrast, you're telling Git to use the current directory as the working tree, and create just the two databases, right here. "Here" in this case is your current working directory, e.g., ~/workspace/project
.
There's a close (and upside down) relationship between the repository databases and the working tree: the repository proper goes in a hidden .git
directory within the working tree (at the top level of the working tree). That is, after:
cd ~/workspace/project
git init
you have a ~/workspace/project/.git/
that contains the two databases and various ancillary files. This is the bulk of the actual repository and is the only part that's actually in Git, since the working tree isn't actually in Git at all.
Normally, we run git clone
to get a copy of some existing project that already has a bunch of commits in it. We are asking Git to:
- make a new, empty directory (or use a directory we've already made, but it must be empty);
- run
git init
in that empty directory to create the.git
subdirectory and initialize it; - call up some other Git software (e.g., on GitHub) and ask them about one of their repositories;
- copy in all the commits from that other Git (the software on GitHub using the repository on GitHub); and
- some other stuff, which we'll get back to in a moment, but which would potentially wreck files in the working tree.
If you already have some files, this method doesn't work, because the area you're using as a working tree isn't empty. To avoid wrecking the files that are there, git clone
gives you that error you've just seen.
You have a bunch of options, with the two main ones being:
Use
git init
to create a new, empty repository right now, then fill it in "by hand". This is described in the accepted answer at How do I clone into a non-empty directory? (as linked by phd in a comment).Clone into a different (new, or existing-but-empty) directory. You can then decide what, if anything, to do with the files that wind up in the working tree of that directory, and what to do with your own existing files.
In any case, remember that Git stores commits, not files. So your pick of the above two, or anything else you decide to do, should be based on this concept. My usual approach here when using GitHub is this:
I create a repository on GitHub first, having GitHub fill in a README and LICENSE and such as a prototype, then clone that, and then start writing code. This lets GitHub fill in an initial commit (what Git calls a root commit). Having a root commit is convenient, but not necessary.
Or, I create a repository on my own machine ("my laptop", I'll call it, even if it's not actually a laptop) and put commits in it (usually starting with just a README and maybe LICENSE and such as a prototype). Then, when I decide to put this onto GitHub, I'll have GitHub make a new empty repository: one that has no initial commit at all!
Why do it this way?
Let's talk very briefly here about commits. We already mentioned that every commit is numbered. It's also strictly read only: once you make a commit, you can never change anything about that commit. The magic hash IDs1 that Git uses require this.
There are two other things you need to know about commits:
They store files, but they store full snapshots. That is, every commit holds a frozen-for-all-time copy of the entire source. This "holding" is indirect and very clever, in that the files in the snapshot are compressed and de-duplicated. So if a new commit mostly matches an old commit, it mostly takes no space for the files. Only all-new files—those that don't duplicate any previous file content at all—require new space.
They store some metadata, or information about the commit itself. The metadata include information such as the name and email address of the person who made the commit, for instance.
In the metadata, Git stores something that makes Git work: each commit stores a list of previous commit hash IDs. Most commits store exactly one hash ID here. We call this the parent of the commit. Since commits are frozen once made, a child knows who its parent is, but the parent has no idea what children it might have (they have not yet been made!).
These commits, the ones that store just one parent hash ID, are ordinary commits. Most commits are ordinary, and we can draw a string of them, with the latest one on the right, by using uppercase letters to stand in for hash IDs:
... <-F <-G <-H
Here H
(for "hash") stands in for the actual last commit in the chain. It has a snapshot and some metadata, and in its metadata, commit H
stores the raw hash ID of the previous commit G
. But G
is an ordinary commit too, so it stores a snapshot and metadata and points backwards to a still-earlier commit F
, which has a snapshot and metadata and points backwards, and so on.
This means that, as long as we memorize the hash ID of the latest commit, we can give that to Git. Git can then work backwards from there to find all the earlier commits. If we call that a "branch"—there's an issue here, as there are multiple things that Git calls a "branch"—then this "branch" consists of all the snapshots from H
on backwards to the very first snapshot.
A command like git log
, that views commits, does so by starting at the end—commit H
—and working backwards, one commit at a time. This shows you H
, then G
, then F
, then whatever is earlier (E
obviously), and so on—but eventually we hit the very first commit (A
, presumably):
A--B--C--...--G--H
and we simply can't go any further back. Commit A
is special: it's a root commit, i.e., it's that initial commit. Its list of previous commits, in its metadata, is empty. This lets Git stop going backwards.
1Hash IDs are "magic" because every Git repository in the universe agrees that that hash ID, whatever it is, means that commit as soon as any one commit exists and thus has a hash ID. They do this without ever talking to each other. This magic is mathematically impossible, and someday, Git will break. The sheer size of the hash ID puts this day far into the future: far enough, we hope, that we'll be long dead and gone and won't care. In practice, it works fine, although with SHA-1 nominally broken (see How does the newly found SHA-1 collision affect Git?), Git is moving to SHA-256.
Branch names find the last commit
A branch name, in Git, is simply a special kind of name—the "branch" kind of name—that holds one hash ID. Git stores these in that second database, the names database. If we have just the one branch named main
or master
(I'll use main
here since that's the new GitHub default), and we have this collection of eight commits ending at H
, then we have this:
...--G--H <-- main
That is, the name main
stores the hash ID of commit H
, the latest commit. We don't have to memorize it! We just tell Git look up the name main
and Git finds the hash ID there, and goes to commit H
.
Git has a word for this kind of combination, where the name main
points to commit H
. Git says that commit H
is the tip commit of branch main
. All the other commits, going backwards from H
the way Git does, are also "on" branch main
, but H
is the last one on main
, so it's the tip.
If we were to make a new commit at this point, that new commit would get a new, totally-unique hash ID (see footnote 1 again). Git would set up this new commit—we'll call it I
—to point backwards to H
, as H
was the commit we were using when we made I
. And then Git would write I
's new unique hash ID into the name main
, and main
would point to the new commit.
But suppose that, instead, we make a second branch name now, such as feature
? Now we have:
...--G--H <-- feature, main
Which branch are these commits on? Well, that's a trick question, because in Git, these commits are all suddenly on two branches now.2 We now need a way to know which name we're using, even though both names select commit H
. So we'll add this to our drawing:
...--G--H <-- feature, main (HEAD)
This means we are "on" branch main
: if we run git status
, Git will say On branch main
. If we now run:
git switch feature # or git checkout feature
we'll still be using commit H
, but we'll be "on" feature
now, according to git status
.
If we make our new commit I
now, we get:
...--G--H <-- main
\
I <-- feature (HEAD)
Note that Git has stored the new commit's hash ID in the name feature
, leaving main
unchanged. If we now git switch main
, we'll go back to commit H
. Or, if we create two branches and then add two commits to each branch, we get something like this:
I--J <-- br1
/
...--G--H <-- main
\
K--L <-- br2
Keep this in mind in the future, as you start to work with "branches" in Git: the branch names are just ways to find the commits. It's actually the commits that form the branching structure (or don't, at the beginning when all the names point to H
). You check out a branch (or git switch
to it) to select its tip commit. The commits up through H
here are on all three branches. Branch names come and go: you're free to create or delete them at any time, in Git. It's the commits that matter (but you'll want a branch name to find commits, so that you don't have to use raw hash IDs).
2Think about this: the branch (in one meaning) is the set of commits up through H
. The branch is on two branches. Does that make sense? Whether it does or doesn't make sense to you, that's an example of how Git abuses the word branch.
Empty repositories are a bit weird
Let's attempt to draw an empty repository:
<-- main
That's actually wrong! The name main
must point to some existing, valid commit. There aren't any commits. So the name can't exist either:
There's my best drawing of an empty repository: just a blank space. There are no commits so there cannot be any branch names.
This is what makes a new, empty repository weird. It's why GitHub likes to create an initial commit. Without an initial commit, you can't have any branches, and you don't. And yet, Git insists that you have to be "on" some branch, so you wind up on a branch that doesn't exist, which is also weird.
The weirdness shakes right out as soon as you make your first commit: the branch name springs into being, pointing to that new root commit:
A <-- main (HEAD)
and now all is fine.
As long as you understand that a truly empty repository is a little bit weird like this—and that git clone
complains when you clone one of these empty repositories—you'll be fine with empty repositories. You just have to remember they're weird, and that's why GitHub likes to make an initial commit.
Cloning (again)
Let's look at the act of cloning again, and finish out the steps it takes. The git clone
command is essentially a sort of convenience wrapper that runs up to six or so other commands, with the first one being the "make new empty directory". (This first step is skipped if you point git clone
to an existing empty directory.) So the six commands are:
mkdir
(or your OS's equivalent): make the new empty directory. Run the rest of the commands in that directory.git init
: this makes a new, totally empty repository, using the empty directory as the working tree.git remote add origin url
: this saves the URL you pass togit clone
, so that you won't have to type it in every time. The nameorigin
here is the conventional name: you can override it with an option, but I'll assume you didn't.- Any necessary
git config
or other operations go here. For a simplegit clone
there's nothing here, but I like to enumerate it as a place commands can get run. git fetch origin
: this is the step that reaches out to the saved URL, at which there must be Git software that connects to a Git repository. You get all of their commits, and then your Git software takes each of their branch names, and changes those into a remote-tracking name.- Last, your own Git will create one branch name and check out that particular commit. (This step fails when cloning an empty repository, and you get a warning.)
Step 5 has an oddity: you don't get branch names from their branch names, you get remote-tracking names. A remote-tracking name—which Git calls a "remote-tracking branch name", beating up the poor word "branch" some more—is just your own software's way of remembering the other repository's branch name: your origin/main
corresponds to their main
, your origin/feature
corresponds to their feature
, and so on.
All these remote-tracking names go into your new names database, so (assuming the repository you're cloning is not empty) you will now have all their commits and some names, but you have no branch names. You have no branches, except for the kind of branches we mean when we're talking about commits instead of branch names. If you're not confused yet—this is what I mean about the word branch being terrible in Git—we now get to step 6.
The branch name that your Git creates here is the one you select with the -b
option to git clone
. If you don't give -b
to git clone
, your Git software asks their Git software which branch name they recommend, and then uses that name. If you're using GitHub (and own the repository there), you can set the recommended name via GitHub's web pages: GitHub and Git call this the "default branch". Most hosting sites have a way to do this (although Google Git hosting doesn't, which is a problem these days).
To create the new branch name, your Git looks at your remote-tracking names. Let's say they have a main
, which your Git renamed to origin/main
, and that they recommend their main
and you didn't say -b
. Then your Git software reads out your origin/main
, which is the same as their main
, to get the commit hash ID. Your Git software creates one new branch name, main
, and points it to the same commit. So now you have:
...--G--H <-- main (HEAD), origin/main
\
I--J <-- origin/feature
for instance.
If you want to have your own feature
, you can now git switch feature
, and your Git will create a new branch name feature
that points to commit J
, using your origin/feature
that corresponds to their feature
.
While you and they may have some or all branch names the same, after you do some branch-name-creating, your branch names are yours. Branch names, in Git, move. The commits don't move—they can't; they're read-only!—but we add new commits and when we do that, the current branch name moves. So our names will move around to reflect new commits we add, in our repository.
No other repository, including the one on GitHub, has our new commits yet! So our branch names are the only way that anyone has to find these commits. We're the only one with these new commits, and we find them with our branch names. Our branch names had better not move back, or we won't be able to find the commits (unless you've memorized their hash IDs). So our Git doesn't move our branch names just because theirs have moved. That's why our branch names are ours.
Because commits are read-only and have unique numbers, it's safe for repositories to share them: we can send our new commits to them, and/or they can send any new commits they've made to us. We and they have the same commit if and only if we and they have commits that have the same numbers. All Gits agree that commit 9bf691b78cf906751e65d65ba0c6ffdcd9a5a12c
is commit 9bf691b78cf906751e65d65ba0c6ffdcd9a5a12c
; no other commit anywhere in the universe is 9bf691b78cf906751e65d65ba0c6ffdcd9a5a12c
; so either we have the same 9bf691b78cf906751e65d65ba0c6ffdcd9a5a12c
, or one or both of us don't have 9bf691b78cf906751e65d65ba0c6ffdcd9a5a12c
at all.
In general, we use git fetch
to get commits from them, and git push
to send commits to them. But we'll stop here and make some last notes about working trees.
Your files in your working tree
When you check out a commit (with git checkout
or git switch
), Git will fill in your working tree with files from that commit. The reason why is really simple and obvious: commits are read-only and frozen for all time.
The commit contains the files, but the files in the commit are completely unusable to anything except Git: they're stored in a weird format that most software can't read,3 and nothing—not even Git itself—can change. They're only good as an archive, like a tarball or zip file or WinRAR or something. So Git extracts those files. Your working tree is, initially, empty: Git can just place those files there.
Having extracted files from some commit, if you'd like to switch to some other commit, Git can just remove those files from your working tree, and replace them with files from the other commit. The files came out of a commit, and you didn't change them, so it's safe to destroy them.
Once you start working on your files, though, the picture changes drastically. It's no longer safe to just remove-and-replace files. I won't go into all the details of how Git keeps track of the files here, except to mention that it involves something for which Git has three names: the index, the staging area, or (rarely seen now except as --cached
flags) the cache. When Git extracts a commit snapshot, it puts the full snapshot into its index / staging-area, as well as copying files into your working tree. You work on the working tree copy, which is an ordinary file. You must then use git add
to tell Git: update the index / staging-area copy. This extra copy is in the frozen format—compressed and de-duplicated, in other words—but isn't actually frozen. The git add
command does the compressing and the checking-for-duplicates.
The eventual "make a new commit" git commit
command takes whatever is in the index at the time you run git commit
, and freezes that. So your git add
operations update your proposed next snapshot. You start out with a proposed snapshot that matches the current commit. You then change the working tree version—the one you can see and play with—and git add
the files to update the proposed new snapshot.
It's when you're ready to turn the proposed snapshot into a real one—a new commit—that you run git commit
. In between, use git status
(and perhaps git diff
and git diff --staged
) a lot, to view the difference between what's in the current commit, the index / staging-area, and your working tree.
3File contents are stored either in "loose objects", which aren't all that difficult to read, or in "packed objects", which are. Using a zlib library you can easily read a loose object, peel off the Git header, and get the data out that way. But packed objects are much more complicated. It's better to just let Git manage this.
CodePudding user response:
I created a private repository on github.com.
And at that point, GitHub told you exactly what to do next, under the heading …or create a new repository on the command line. But you didn't do it.
When I tried to
git clone https://github.com/username/project
in ~/workspace/project
But that's not what GitHub said to do. It said you should
echo "# temp" >> README.md
git init
git add README.md
git commit -m "first commit"
git branch -M main
git remote add origin <oridin>
git push -u origin main
Your best bet at this point, aside from learning the rudiments of Git, is to delete the private GitHub repo, create a new GitHub repo, and this time look at what GitHub tells you to do next. It gives very precise directions, if only you would care to pay a little attention to them.