Home > Software design >  Does checking a large binary and then deleting it effects repository size permanently?
Does checking a large binary and then deleting it effects repository size permanently?

Time:05-11

I had merge request and checked in a build folder containing binaries by mistake, I deleted the build folder and did another commit to remove it from the merge request. that merge request was merged to master later but I have a debate with colleague who believes, even though i have deleted the binary folder, just because it has been part of my commit history, it will forever increase the size of the repository (master branch). I wonder if that is accurate. thank you

git checkout -b somebranch
//did some work 
//created 
git add . 
git commit -m 'xxx'

git push --set-upstream somebranch

rm -rf build/
git add . 
git commit -m 'yyy'
git push 

somebranch was merged to master

is the size of the repository increased as much as the size of the removed binaries? does this slow down the subsequent cloning of the repo?

CodePudding user response:

Does checking a large binary and then deleting it effects repository size permanently?

Yes it does. You can always go back to a commit where the file existed and check it out, so git needs to save it permanently.

Note however the possibilty to do "shallow" clones using the git clone --depth parameter. This allows you to grab only part of the history of a remote repo, reducing the size on your local drive and increasing performance.

Depending on where in your history the file existed you could use git rebase to rewrite the history to have never contained that file. Remember that the actual disk space is only freed some time afterwards when git actually removes the now orphaned object during garbage collection.

CodePudding user response:

I have a debate with colleague who believes, even though i have deleted the binary folder, just because it has been part of my commit history, it will forever increase the size of the repository (master branch). I wonder if that is accurate.

Basically yes, it's totally accurate. The fundamental principle of Git is that commits cannot be lost or modified. Therefore if you made a commit that had the big files in it, that commit is still there and by definition so are the big files. Moreover, every commit contains all the files in the project, so every commit after the bad commit also has the big files.

So this is not a problem that you can solve just by making some new commits. The way you solve this kind of problem is to run the entire repository through some potentially dangerous tool (such as filter-repo) that effectively rewrites the whole repository in a tricky way, manipulating the existing commits in a deep under-the-hood way, ripping out the bad commits and making replacement commits that lack the problematic files.

CodePudding user response:

You've tagged this with , but you said:

that merge request was merged to master later ...

Git itself does not have anything called "merge requests" (nor "pull requests" for that matter). Git does have merges and the git merge command (and also the git request-pull command, but this merely makes text suitable for sending as part of an email message: it does no merging). GitLab offer something called Merge Requests, and other hosting sites offer things called Pull Requests, but we have to guess what you mean here since you have not mentioned GitLab.

What we can say, about Git in general, is this:

  • If the commit(s) with the large files in them exist, they need space for those files.
  • If those commit(s) no longer exist, they don't need space for those files.

So the Git answer to whether these:

forever increase the size of the repository

depends on whether those commits still exist. Answering that question is where things get complicated, because MRs and PRs and hosting sites add wrinkles that aren't there in plain Git.

You added this to that last bit:

(master branch).

The branch name isn't really important. Git isn't about branches; it's about commits. We organize our commits into things we call "branches"—the word branch is badly overused in Git, almost to the point where it becomes meaningless, but the collections-of-commits form "branches"—and we use branch names to find specific important commits, by which we can find the earlier commits that make up these collections that we call "branches". But at the end of the day, it's really all about the commits.

Each Git commit:

  • Is numbered: it has some big ugly hash ID (e.g., a123456 when shortened). This hash ID—or more formally, object ID or OID—is literally how Git finds the commit. Git needs the OID to locate the commit inside its big database of all-of-its-Git-objects.

  • Is read-only: no part of any commit can ever be changed.

  • Stores two things: a full snapshot of every file, plus some metadata (information about the commit itself).

The files stored in the commit are stored in a special, read-only, Git-only, compressed and de-duplicated format, so that if there are a million commits but all of them just reuse one file over and over again, there's only one copy of that file in the repository. They can share the files like this because the objects are all read-only like this. Since it's impossible to overwrite that object, you can't change the stored file, so it's safe for all the commits to share the file. But as long as some commit, somewhere in the objects database, has some big file, the repository has the file in it, and therefore that file takes (some) space.1

The metadata in any given commit includes, for Git's own use, the raw hash ID of the parent (or parents, plural) of that commit. This parent linkage forms a backwards-looking chain of commits, which is how "branches" actually work: the branch name gives the raw hash ID of the latest commit, and from that commit, Git works backwards, one hop at a time, from a child commit to its parent. That parent is itself a child of another parent (the "grandparent" of the latest commit), and that parent has another parent, and so on. Traversing this list, backwards, one hop at a time, finds the history in the repository; the commits are the history in the repository, chained together (backwards) via this metadata.

The upshot of all of that is straightforward enough: if you have the latest commit in a Git repository, you also have every earlier commit. That's because a Git commit automatically brings with it its parent, which automatically brings with it another parent, and so on, all the way back to the start of history.2 This is what perivesta means by this comment.


1The amount of space needed for a file depends heavily on how well it compresses. Git has two different kinds of compression, too. As a general rule, these compression techniques work really well on source files made up of human-readable text, not very well on build products, and not at all on pre-compressed files like JPG images or zipped archives or whatever. But that's a separate issue.

2A special kind of clone, a so-called shallow clone, omits certain parents at "shallow graft points". Shallow clones have certain restrictions, so we normally use full (non-shallow) clones. The repository size you're asking about is that of a non-shallow clone in the first place, so we get to ignore this special case.


How we get rid of commits

A Git repository is normally built up over time by adding commits, one by one, like bricks making up a building. We never remove a commit: we just add a new one. Your "commit build products, remove build products, commit again" process made two commits, one of which holds the build products and one of which doesn't. The second of these commits requires the first one because of the way Git stores commits and metadata.

But sometimes we really do want to remove a commit. We can do that—sort of—using the fact that a branch name is defined as "contains the hash ID of the latest commit". When we add commits to a repository, Git works like this:

... <-F <-G <-H   <--somebranch

Here, the name somebranch points to (contains the raw hash ID of) the currently-latest commit H (the letter H stands in for some actual, random-looking, big ugly hash ID). Commit H points backwards to its parent G (the metadata for H contains G's hash ID), G points backwards to F, and so on.

When we make a new commit using commit H as the current commit, Git stores a new snapshot-and-metadata and sets it up so that the new commit points backwards to H. Then Git stores this new commit's hash ID into the branch name:

... <-F <-G <-H <-I   <--somebranch

where I is our new commit.

To get rid of commit I, and commit H too, we can tell Git that it should store the raw hash ID of commit G into the name somebranch. When we do that, we get:

          H--I   ???
         /
...--F--G   <-- somebranch

The name somebranch now points to G, which continues (forever!) to point backwards to F. Commits H-I are still in the database but nobody can find them.

Once we do this kind of thing, Git will eventually drop commits H-I entirely. Precisely when and why is complicated, but the key is that there must be no names by which we can find the hash IDs. As long as no name lets us find commit I, commit I will eventually get tossed out. As long as commit I is being tossed out and it's the only way to find commit H, commit H will eventually get tossed out too. Commit G, on the other hand, is easily found: the name somebranch finds it. Commit F is easily found by starting with the name somebranch, which finds G, and working backwards. So commits up through and including G won't be tossed out.

This means that we get rid of commits, in Git, by re-arranging our branch names (and other names as needed: tag names, remote-tracking names, and any other names that might exist in the repository) such that they don't find the commit. Because commits find earlier commits, though, it's absolutely necessary that we strip away all commits after some point, in order to strip away that particular commit. To get rid of commit H we must get rid of commit I too!

Now we can talk about the mechanics involved: what commands we use in Git to do this sort of thing.

git reset and git branch -f

If we just want to move some branch name, we can tell Git move the branch name. There are two commands that do this:

  • git branch -f will let you take any branch name and make it point to any commit. You just supply the name and the hash ID. But there's one restriction: you can't have this branch checked out. If you do have the branch checked out, you need the other command.

  • git reset will let you move the current branch name to any commit. You just supply the hash ID. The current branch moves to point to that hash ID. You must, however, choose what Git should do with Git's index and your working tree. (This gets complicated but I'll skip all the details.)

The negative part of doing this, of course, is that we lose all the commits after the one we want to drop. If commit H is bad, but commit I is good, and we reset or git branch -f away commit H, we lose commit I too.

Rebase

The git rebase command, especially in its interactive rebase form, lets us re-arrange commits however we like. Now, we already know that it's impossible to change any commit, so that's not what rebase does. Instead, rebase works by copying commits. A rebase operation essentially runs multiple git cherry-pick commands, with each cherry-pick being a single-commit-copy step. So rebase automates these copies. But it also adds a first and last step.

Suppose we have these commits:

...--F--G--H--I   <-- some branch

and we'd like to drop commit G entirely, but keep H and I. To do that, we need to copy H and I to new-and-improved commits, which we'll call H' and I'. These new commits will have new snapshots and new metadata: new commit H' will use commit F as its parent, and new commit I' will use H'—the copy of H—as its parent, like this:

       H'-I'
      /
...--F--G--H--I   <-- somebranch

Once we've made these two copies, git branch -f (or an equivalent git reset) can move the name somebranch to point to I':

       H'-I'  <-- somebranch
      /
...--F--G--H--I   ???

which we can draw this way:

...--F--H'-I'   <-- somebranch
      \
       G--H--I   ???

Git will now—well, eventually—discard the entire G-H-I chain, since there's no way to find it. When we look at history, we see commits that look exactly like the original H and I—they make the same changes as H and I did, and have the same log messages—but they have different hash IDs.

Merge vs squash-merge

Besides rebasing commits away, there's one other trick we may often use called a squash merge. This one is workflow-dependent: some people love squash merges, some people hate them, some are kind of indifferent, and a very small number of people actually know what they're doing and use them exactly when they're appropriate.

  •  Tags:  
  • git
  • Related