Home > Blockchain >  How to publish a new git branch with just one squashed initial commit?
How to publish a new git branch with just one squashed initial commit?

Time:08-30

So imagine a scenario where you've been working on just one branch (you are new to git), and you made some weird commits and thought the repo was just for you. Then the time comes when you actually want other people to see it, but you don't want the world to see your silly dev commits (also, there may have been .env files commited, gitignored too late, etc.). Instead of editing history, how about a shiny new squash merge and creating a brand new branch? I can only do it by deleting the repo and initializing a new one (something I've been doing a lot out of frustration, maybe I just don't get it). So I want it to seem like by the time the initial commit was created, I already had a working script. How can I reproduce this while also keeping a dev branch? I don't have any other branch, and even the initial commit is crap. It's not GitHub, "everything is local."

I have GitKraken, but even with the UI, I have no luck. I already renamed master to dev. The result would be: lots of silly commits on dev that no one sees. Squash all that and publish to a new branch with no history, and only that will get pushed to GH.

CodePudding user response:

There are two easy ways

Given your setup (one repository with many commits but just one branch name main or master or whatever), here are the two easy ways to get a single commit containing your desired new commit:

  1. git branch save && git reset --soft $(git rev-list --max-parents=0 HEAD) && git commit --amend

  2. git checkout --orphan new && git commit

The difference between these two is that with method 1, the current branch name (main or master or whatever) will now select your new root commit; with method 2, the new branch name (new above) will select your new root commit.

(If all you want are recipes, you can stop here, if you like. But do read on!)

Why this seems hard

The very first commit ever, in any shiny new empty repository, is kind of special. To understand why—and hence what the above two magic commands actually do—we should start at the beginning:

  • A Git repository is really all about the commits. It's not about files, although each commit holds a full snapshot of every file, and it's not about branches (branch names), although we (and Git) use the branch names to help us find the commits. It's really all about the commits.

  • Each commit:

    • Is numbered, with a big ugly random-looking hash ID (more formally object ID or OID). Git stores the commits, along with other supporting objects, in a big key-value database where the keys are the hash IDs. This means Git needs the hash to retrieve the object (the commit). The numbering scheme is deeply magic.

    • Is read-only. In fact, no part of any object in the objects database may ever be changed as the number (by which Git retrieves the object) is literally a checksum of the contents. To make this work well, it's a cryptographic checksum that works like a digital signature. If you take an object (commit or other object) out, modify it, and put the new data back in, you get a new, different object with a new, different number, unless you exactly repeat some existing object. Commits have stuff in them that doesn't repeat, so they always get unique numbers (er, well, as long as we ignore the pigeons in the pigeonhole principle...).

    • Stores that full snapshot of every file. Each files' content—the bytes making up the data within a file—are turned into objects (in that same database) so they're all magically de-duplicated—repeat occurrences re-use the existing object—and they get compressed (initially, zlib compressed only; later, further compression happens) so they really don't take much space after all.

    • Stores some metadata: information about the commit itself. This includes, e.g., the name and email address of the person who made the commit.

Git writes the metadata itself: you supply parts of it, like a log message, but the metadata are in a Git-ized format, and Git includes, in this metadata in each commit, a list of previous commit hash IDs. This list is usually exactly one entry long, producing what Git calls an ordinary commit.

If we have exactly one branch name and we've only ever done totally linear work (we've never branched-and-merged), then we have a really simple repository, which is easy to draw. Let's say our branch is named main. You can use any name you like as long as it meets the constraints described in the git check-ref-format documentation, but this is one of the new standard names, so we'll use it here.

Your branch has a latest commit, which Git calls the tip commit. The branch name actually contains the hash ID of this tip commit, so that Git can find that commit quickly in its big objects database. (The names—branch names, tag names, and all other sorts of names, are stored in a secondary key-value database, with the names as the keys; each gets one hash ID as its value.) We say that the branch name points to the commit. Let's call this commit's hash ID H (for Hash), and draw the name main pointing to H:

            <-H   <-- main

What's this arrow sticking out of H? It represents the metadata in commit H, which stores the hash ID of H's parent commit. Storing the hash ID of a commit makes something point to the commit. Let's call that parent commit G, and add it to our drawing:

        <-G <-H   <-- main

But like H, G is an ordinary commit, so it points back to yet another, still-earlier commit:

... <-F <-G <-H   <-- main

and this goes on forev—well, no, not forever! It stops at some point:

A--B--C--D--E--F--G--H   <-- main

(assuming there are eight total commits). Commit A is the very first commit, and it doesn't have a left-pointing arrow sticking out of it. This makes it special. It is a root commit, in Git jargon.

Note that I've also gotten lazy about drawing the arrows between commits. That's in part because no part of any commit can ever change. It's only the arrow coming out of a branch name that can change. The names are stored separately, in that names database, with their values being hash IDs, and we can replace the stored hash ID at any time.

Making new commits from existing commits

When we're using an existing Git repository, it has some existing commits, one of which is the latest on some branch. Git attaches a special name HEAD to the one branch name you're actually using (though if you have only one name, that's going to be the one name you're using), so I'll draw that in:

...--G--H   <-- main (HEAD)

If and when we choose to make a new commit—without worrying about how we update the files that are to go into the new commit; we'll come back to this—Git will create that commit by:

  • freezing a snapshot into a permanent archive, so that we can get it back;
  • wrapping that snapshot with commit metadata, including a parent; and
  • writing all of that stuff into the objects database.

This produces a new, random-looking (but not actually random), unique hash ID; we'll just call it I. Commit I stores commit H's raw hash ID in commit I's metadata, so that I points back to H. And then Git simply writes I's hash ID, whatever it turned out to be—we have no idea what it will be, until we have it, as one of the inputs to this is the exact second at which we make the commit—into the current branch name as represented by the name HEAD:

...--G--H--I   <-- main (HEAD)

We now have a new latest commit, and main automatically selects that latest commit. Git can use the latest commit to work backwards to find H, which Git can use to work backwards to find G, and so on, all the way back to A.

So we make a new ordinary commit and it has a single parent. That's the very definition of an ordinary commit. The git commit command does that for us, using the current commit as the parent of the new commit.

Note that it works the same way even if we have more than one branch name. Suppose that main points to H, as it did before we made I:

...--G--H   <-- main (HEAD)

Suppose we now create a second branch name, also pointing to H:

...--G--H   <-- develop, main (HEAD)

If we now switch to the name develop, we get:

...--G--H   <-- develop (HEAD), main

We are still using commit H. We're just using it via the name develop now. When we make our new commit I, we get:

...--G--H   <-- main
         \
          I   <-- develop (HEAD)

If we switch back to name main now, Git removes the commit-I files (which are archived forever in commit I) and puts back the commit-H files (which are archived in commit H). We can, if we want, create another branch name for yet more commits, or make commits that extend main. As we do more work with different branch names, we build up a commit graph that has obvious branches in it:

          I--J   <-- br1
         /
...--G--H
         \
          K--L   <-- br2

We can do all this with ordinary Git commands: branches grow just by adding commits, and if we grow different branches that start out sharing commit H and all earlier commits, those branches grow in "different directions", as in this case.

Note that commits up through H are on all the branches here. If there's a name main pointing to H, commit H is on three branches. Add another name pointing to H and now commits up through H are on four branches. Nothing changes in the commits when we do this. The branch names don't really matter! The point of the branch name is to find the last commit. Git uses the last one to work backwards, and if we remove all the names, so that we can't find the last commit somehow, we "lose" that commit.

(Git may eventually remove totally-unused commits, and in fact, Git relies on this and generates "junk" or "trash" commits and other objects at various times, whenever it seems convenient. This is normally quite invisible and you don't have to care.)

History, in Git, is nothing but commits. We find history by using the branch names (and other names such as tags or remote-tracking names) to find "last" commits, then working backwards as much as we want to.

Merge commits are a bit special

Though this has nothing to do with the answer to your question, it's worth mentioning. Git makes a merge commit to tie two histories together. Consider this structure again:

          I--J
         /
...--G--H
         \
          K--L

Git can merge commits J and L to produce:

          I--J
         /    \
...--G--H      M
         \    /
          K--L

Commit M is a merge commit, which in Git is defined as any commit with two or more parents. Here, commit M points backwards to both J and L. "Two" is the usual number—three or more parents makes an "octopus merge" and this doesn't do anything you could not do with regular merges—and since we're not getting into merging here, I'll stop there except to note one more thing.

In the "before" picture, where there was no merge commit M, we needed two branch names, one to find J and one to find L. In the "after" picture, we can get by with just one branch name, to find M (or any new ordinary commits we add after M, as long as they don't fork off into more branch-y structures). This is why, after merging like this, we can delete one of the two branch names.

More about making commits: Git's index and your working tree

You'll note that, at this point, all the commits we've made—whether with git commit or git merge—have been normal or merge commits, except that first one in the new, totally-empty repository. An ordinary commit has the current commit—as found by HEAD until Git updates the branch name—as its parent, and a merge commit has the current commit as its first parent, and the tip of the merged-commit—e.g., commit L above, perhaps, assuming we were using J when we merged—as its other parent.

When Git does make a new commit, Git does so from the files that Git has in what Git calls its index, or the staging area, or (rarely these days) the cache. These three names all refer to the same thing.

When you check out a commit, with git checkout or git switch, Git fills in its index from the commit you've picked to switch-to. The index therefore holds all the files from the snapshot in the commit. Git also copies these files to an area where you can see and work on / with them. Remember how I emphasized earlier that all parts of every commit are read-only, and are compressed (Git-ized) and de-duplicated as well. That means the files in the commit are completely unusable for any purpose except to be un-archived like this, or more generally, used internally by Git for reading.

Most version control systems face this same kind of problem (immutable stored files) and have the same solution (extract those files). But most of them just extract the files to your working tree, so that you can see them and work on / with them, and that's it. This produces two copies of each file: the permanent archived one in the commit, and the usable one.

Git, however, stores three copies. There's those two, but in between, Git sticks a third copy—or "copy", because it's pre-de-duplicated—into what Git calls that index or staging area. The key difference between the index copy and the committed copy is that you can have Git replace the staging-area copy.

This is what git add does. When you git add an existing file, Git reads the file, compresses it, Git-izes it, and checks for duplicates. If there's a duplicate, Git uses the existing object (uses the old hash ID). If not, Git readies the contents for committing (they get a new, unique hash ID). Either way, the file is now ready to be committed—or re-used, if it's a duplicate—and Git updates the index slot that holds that file.

So, before git add, the entire set of files was ready to be committed. After git add, the entire set of files is still ready to be committed. You have merely changed one of those files (changed its contents, really).

If you git add a new file name, Git compresses its contents just as for an existing file, and checks for duplicate contents the same way as usual and either re-uses an old hash or gets a new one. Then Git writes a new index slot for the new file name, and now the staging area is ready to go into a new commit. So it was ready before git add, and it's still ready after git add: it's just acquired one new file name.

In other words, at all times, the index holds the next commit, ready to go. This skips over the problem of merge conflicts (in which case the staging area holds files that aren't ready to be committed: they're in the right form, but they're marked "conflicted", in ways we won't cover here). So thinking of the index / staging-area as "my proposed next snapshot" is not quite perfect, but it's good enough to get things done, and if git commit tells you you have unmerged files, that's a reminder that your job is to fix up the index / staging-area now.

Because the index is separate from your working tree, it's possible to have three different copies of some "active" file. For instance:

git switch somebranch
vim foo.py        # make some changes and write them out
git add foo.py    # update index copy of foo.py
vim foo.py        # make more changes and write them out

Now the committed (HEAD) foo.py still has the original content, the index / staging-area foo.py has the middle version, and the working tree foo.py has the most recent version. If you git add again, the index copy starts matching the working tree copy again.

Usually either all three copies match, or two copies match and one is different. You git add files to make the index copy match the working tree copy, so that now the two that match are index and working tree. You run git commit to make a commit from the index copy.

git commit --amend: a useful lie

Method 1, way at the top of this answer, uses git commit --amend. The --amend flag sounds like Git is fixing up a commit. This is a lie, but it's a useful lie. Let's suppose we have:

...--G--H   <-- main (HEAD)

and we work on and make a new commit:

...--G--H--I   <-- main (HEAD)

but then we notice a typo in the commit message. The snapshot—the files that were in the index, and still are in the index—is fine, but we want to fix our typo, so we run:

git commit --edit --amend

(there's --edit and --no-edit options as well as --amend for this case) and fix our typo, write out the updated commit message, and let Git get to it.

Commit I literally cannot be changed. But Git can "boot it off the end" of the branch and make a new, improved I', like this:

          I   [abandoned]
         /
...--G--H--I'  <-- main

Since there's no name by which to find commit I, we won't see it any more. If we don't pay close attention to hash IDs—and who does that?—it will look like Git changed commit I.

The --amend option works by giving the new commit the same parent(s) as the current commit. This lets you "amend" a merge commit (which I won't draw for space reasons, plus it's hard to draw well). But it also means that if you have:

A--B--C--D--E--F--G--H   <-- main

and make a new name new point directly to commit A (and switch to it):

A   <-- new (HEAD)
 \
  B--C--D--E--F--G--H   <-- main

and run git commit --amend, our new commit will have the same parents that A has. Since A has no parent, our new A' will have no parent too:

A'  <-- new (HEAD)

A--B--C--D--E--F--G--H   <-- main

This is the trick that we'll use in method 1.

Now, the problem with just checking out commit A directly is that Git would rip out all our working tree and index copies of all files, and replace them with the files from commit A. So instead of using git checkout, we'll use git reset --soft.

git reset

The git reset command is very big and complicated. It has a couple of major modes though, covered by the --hard, --mixed, and --soft options, along with its many other modes that we won't cover here for space reasons. When used in these major modes, git reset does three things:

  1. It moves the current branch name to point to some commit. You pick any existing commit, and the name now points there.

    If you used --soft, git reset stops here. Otherwise it goes on to step 2.

  2. It resets Git's index, making it match the commit you selected in step 1.

    If you used --mixed (or no flag), git reset stops here. Otherwise (--hard), it goes on to step 3.

  3. It resets your working tree in the same way it reset Git's index, so that the files you have checked out are those from the commit you reset to.

Note that when Git wipes out index and/or working tree copies of files, those may be the only copies of those files. This is what makes a hard reset so particularly dangerous. However, sometimes that's just what we want: git reset --hard HEAD moves the current commit, in step 1, to the select commit, which is ... the current commit. So step 1 happens but the branch name continues to select the same commit as before. Then steps 2 and 3 wipe out the work we did, which is what we want from git reset --hard.

In our case, though, we want git reset --soft. This will move the current branch name. Since the current branch name in our setup is main, we:

  1. create a new branch name, save, to remember commit H: git branch save;
  2. use git reset --soft to reset to the root commit: git rev-list finds that;
  3. git commit --amend makes our new A' commit.

We end up with:

A'  <-- main (HEAD)

A--B--C--D--E--F--G--H   <-- save

which is what option 1 does.

The git rev-list command is a way of finding commit hash IDs. It walks through history the same way git log does—one commit at a time, backwards—and prints out selected commits; here, we have it print only those commits that have no parents. Only the root commit has no parents, so this prints out the hash ID of commit A.

New empty repositories, and the --orphan flags

Suppose we run:

mkdir new-repo && cd new-repo && git init --initial-branch=main

This makes a shiny new repository: basically, two empty databases. There are no commits, and there are no branch or tag or other names at all.

If you run git status in this new empty repository, you'll see something odd. You are "on" branch main or master or whatever you choose for your initial branch name. And yet, git branch won't list any branch names, and git log can't show you any commits. There literally are no commits as the objects database is empty (except, perhaps, for the empty tree and any other tricks Git might have up its sleeves, if Git has sleeves).

A branch name in Git is required to point to some commit. You can't have the branch name if you don't have the commit for it. This is just a Rule of Git. Since there are no commits, there must be no branches. Yet you're still "on" some branch. Git does this by attaching the name HEAD to the branch name, even though the branch name doesn't exist. (Concretely, Git writes the branch name into the file .git/HEAD—but don't count on this, as added working trees are different.)

The way Git handles all this is to call this situation an orphan branch or an unborn branch. (Different bits of Git source use the two different terms, inconsistently, the way index / staging-area is.) When you make a commit while you are "on" an unborn branch, this makes a new root commit. So:

<create some files>
<add the files>
git commit -m initial

creates commit A, the root commit, and creates the current branch name at the same time, and now you have one commit and one branch name and we're out of this squirrelly mode where Git acts kind of weird.

But Git offers the ability to go back into this mode, using git checkout --orphan or git switch --orphan. This is what we use for method 2.

If we want to use this mode, we want to use git checkout --orphan here, and there's a reason for that. Although git checkout and git switch mostly do the same thing, most of the time, for cases like these, the --orphan flag is very different in the two:

  • git checkout --orphan newname leaves the index and working tree alone, putting you on a new unborn branch;
  • git switch --orphan newname empties the index (and updates the working tree to match) while putting you on a new unborn branch.

We want our new root commit to hold the same snapshot as the final commit on the existing main branch. If we're "on" main, with everything all clean so that the HEAD commit, the index, and the working tree all match, and we git checkout --orphan new, we'll retain the desired snapshot. We can now simply git commit to create our new root commit.

That's our method 2 above.

(If you accidentally use git switch --orphan, all is not lost: you can git read-tree -u main to refill Git's index and your working tree. But this command is even more obscure than the --orphan flag.)

  •  Tags:  
  • git
  • Related