Home > Enterprise >  Why would fetching specific git commits use more disk space than fetching all?
Why would fetching specific git commits use more disk space than fetching all?

Time:03-26

If I run git fetch origin and then git checkout <revision> on a series of consecutive commits, I get a relatively small repo directory.

But if I run git fetch origin <revision> and then git checkout FETCH_HEAD on the same series of commits, the directory is relatively bloated. Specifically, there seem to be a bunch of large packfiles.

The behavior appears the same whether the commits are all in place at the time of the first fetch or if they are committed immediately before each fetch.

The following examples use a public repo, so you can reproduce the behavior.

Why is the directory size of example 2 so much larger?

Example 1 (small):

mkdir argo-cd
cd argo-cd/
git init
git remote add origin https://github.com/argoproj/argo-cd.git
git fetch origin
git checkout 497e53b0203638409e3083fa2ffac7d8fb3cce14
git fetch origin
git checkout 32be020af0f8bf6438201ee79b4d2b8037c57154
git fetch origin
git checkout 32d33dedcc70d94177384b235891b99d89497273
git fetch origin
git checkout 2e65b42f05bcc1401d1489e751993ec197f6942c
git fetch origin
git checkout b1ff9dbe1e3e3b2520e94eefc77d0322c765cd75
ls .git/objects/pack  # shows two files
du -h .  # current directory is 96M

Example 2 (large):

cd ..
mkdir argo-cd-fetch
cd argo-cd-fetch/
git init
git remote add origin https://github.com/argoproj/argo-cd.git
git checkout FETCH_HEAD
git fetch origin 497e53b0203638409e3083fa2ffac7d8fb3cce14
git checkout FETCH_HEAD
git fetch origin 32be020af0f8bf6438201ee79b4d2b8037c57154
git checkout FETCH_HEAD
git fetch origin 32d33dedcc70d94177384b235891b99d89497273
git checkout FETCH_HEAD
git fetch origin 2e65b42f05bcc1401d1489e751993ec197f6942c
git checkout FETCH_HEAD
git fetch origin b1ff9dbe1e3e3b2520e94eefc77d0322c765cd75
git checkout FETCH_HEAD
ls .git/objects/pack. # shows ten files
du -sh .  # current directory is 244M

Note: I'm using git 2.32.0.

Note: The question is inspired by an apparent bug in Argo CD (https://github.com/argoproj/argo-cd/pull/8897). That's why I don't just git gc to clean up the waste.

Update / Clarification:

Below are the full logs of each example. But in this case, I pushed each commit to my fork immediately before running the next git fetch. So in this case we know that the initial fetch isn't "fetching everything," leaving the subsequent steps with basically nothing left to do.

Example 1 (small):

$ mkdir argo-cd-fork
~ $ cd argo-cd-fork/
~/argo-cd-fork $ git init
hint: Using 'master' as the name for the initial branch. This default branch name
hint: is subject to change. To configure the initial branch name to use in all
hint: of your new repositories, which will suppress this warning, call:
hint:
hint:   git config --global init.defaultBranch <name>
hint:
hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
hint: 'development'. The just-created branch can be renamed via this command:
hint:
hint:   git branch -m <name>
Initialized empty Git repository in /Users/mcrenshaw/argo-cd-fork/.git/
~/argo-cd-fork (master|✔) $ git remote add origin https://github.com/crenshaw-dev/argo-cd.git

# Fetch 1

~/argo-cd-fork (master|✔) $ git fetch origin
remote: Enumerating objects: 83781, done.
remote: Counting objects: 100% (89/89), done.
remote: Compressing objects: 100% (62/62), done.
remote: Total 83781 (delta 60), reused 45 (delta 25), pack-reused 83692
Receiving objects: 100% (83781/83781), 60.99 MiB | 22.12 MiB/s, done.
Resolving deltas: 100% (52061/52061), done.
From https://github.com/crenshaw-dev/argo-cd
 * [new branch]          add-chart-field-to-application-yaml              -> origin/add-chart-field-to-application-yaml
... removed a bunch of branches and tags for brevity ...
 * [new tag]             v2.1.4                                           -> v2.1.4
~/argo-cd-fork (master|✔) $ du -sh .
 65M    .
~/argo-cd-fork (master|✔) $ git checkout afb1fe635ff7f5c435c5780ba665c72d5bc3c557
Note: switching to 'afb1fe635ff7f5c435c5780ba665c72d5bc3c557'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at afb1fe635 chore: fix unit test

# Fetch 2

~/argo-cd-fork ((afb1fe63…)|✔) $ git fetch origin
remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 1 (delta 0), reused 1 (delta 0), pack-reused 0
Unpacking objects: 100% (1/1), 161 bytes | 161.00 KiB/s, done.
From https://github.com/crenshaw-dev/argo-cd
   afb1fe635..f8fe71ab8  master     -> origin/master
~/argo-cd-fork ((afb1fe63…)|✔) $ git checkout f8fe71ab8f38095e296932b73f929bfbaf24f110
Previous HEAD position was afb1fe635 chore: fix unit test
HEAD is now at f8fe71ab8 test

# Fetch 3

~/argo-cd-fork ((f8fe71ab…)|✔) $ git fetch origin
remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 1 (delta 0), reused 1 (delta 0), pack-reused 0
Unpacking objects: 100% (1/1), 162 bytes | 81.00 KiB/s, done.
From https://github.com/crenshaw-dev/argo-cd
   f8fe71ab8..0363d622c  master     -> origin/master
~/argo-cd-fork ((f8fe71ab…)|✔) $ git checkout 0363d622c391947349689904f6b40209ff3123cd
Previous HEAD position was f8fe71ab8 test
HEAD is now at 0363d622c test

# Fetch 4

~/argo-cd-fork ((0363d622…)|✔) $ git fetch origin
remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 1 (delta 0), reused 1 (delta 0), pack-reused 0
Unpacking objects: 100% (1/1), 161 bytes | 161.00 KiB/s, done.
From https://github.com/crenshaw-dev/argo-cd
   0363d622c..4115a8c12  master     -> origin/master
~/argo-cd-fork ((0363d622…)|✔) $ git checkout 4115a8c1221751b1586caaf9871a0be12b5ce891
Previous HEAD position was 0363d622c test
HEAD is now at 4115a8c12 test

# Fetch 5

~/argo-cd-fork ((4115a8c1…)|✔) $ git fetch origin
remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 1 (delta 0), reused 1 (delta 0), pack-reused 0
Unpacking objects: 100% (1/1), 161 bytes | 161.00 KiB/s, done.
From https://github.com/crenshaw-dev/argo-cd
   4115a8c12..8f01aaddb  master     -> origin/master
~/argo-cd-fork ((4115a8c1…)|✔) $ git checkout 8f01aaddbaf4350217dcc84866275493b19308eb
Previous HEAD position was 4115a8c12 test
HEAD is now at 8f01aaddb test

~/argo-cd-fork ((8f01aadd…)|✔) $ du -sh .
 96M    .

Example 2 (large):

 ~/argo-cd-fork ((8f01aadd…)|✔) $ cd ..
 ~ $ mkdir argo-cd-fork-2
 ~ $ cd argo-cd-fork-2
 ~/argo-cd-fork-2 [128]$ git init
 hint: Using 'master' as the name for the initial branch. This default branch name
 hint: is subject to change. To configure the initial branch name to use in all
 hint: of your new repositories, which will suppress this warning, call:
 hint:
 hint:   git config --global init.defaultBranch <name>
 hint:
 hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
 hint: 'development'. The just-created branch can be renamed via this command:
 hint:
 hint:   git branch -m <name>
 Initialized empty Git repository in /Users/mcrenshaw/argo-cd-fork-2/.git/
 ~/argo-cd-fork-2 (master|✔) $ git remote add origin https://github.com/crenshaw-dev/argo-cd.git

# Fetch 1

 ~/argo-cd-fork-2 (master|✔) $ git fetch origin 8f01aaddbaf4350217dcc84866275493b19308eb
 remote: Enumerating objects: 47713, done.
 remote: Counting objects: 100% (4/4), done.
 remote: Compressing objects: 100% (4/4), done.
 remote: Total 47713 (delta 3), reused 1 (delta 0), pack-reused 47709
 Receiving objects: 100% (47713/47713), 40.90 MiB | 26.40 MiB/s, done.
 Resolving deltas: 100% (31970/31970), done.
 From https://github.com/crenshaw-dev/argo-cd
  * branch              8f01aaddbaf4350217dcc84866275493b19308eb -> FETCH_HEAD
 ~/argo-cd-fork-2 (master|✔) $ git checkout FETCH_HEAD
 Note: switching to 'FETCH_HEAD'.

 You are in 'detached HEAD' state. You can look around, make experimental
 changes and commit them, and you can discard any commits you make in this
 state without impacting any branches by switching back to a branch.

 If you want to create a new branch to retain commits you create, you may
 do so (now or later) by using -c with the switch command. Example:

   git switch -c <new-branch-name>

 Or undo this operation with:

   git switch -

 Turn off this advice by setting config variable advice.detachedHead to false

 HEAD is now at 8f01aadd test

# Fetch 2

 ~/argo-cd-fork-2 ((8f01aadd…)|✔) $ git fetch origin 3fad137f5dcd8ebdb504a8b8de0138fb92d76458
 remote: Enumerating objects: 47714, done.
 remote: Counting objects: 100% (5/5), done.
 remote: Compressing objects: 100% (5/5), done.
 remote: Total 47714 (delta 4), reused 1 (delta 0), pack-reused 47709
 Receiving objects: 100% (47714/47714), 40.90 MiB | 19.89 MiB/s, done.
 Resolving deltas: 100% (31971/31971), done.
 From https://github.com/crenshaw-dev/argo-cd
  * branch                3fad137f5dcd8ebdb504a8b8de0138fb92d76458 -> FETCH_HEAD
 ~/argo-cd-fork-2 ((8f01aadd…)|✔) $ git checkout FETCH_HEAD
 Previous HEAD position was 8f01aaddb test
 HEAD is now at 3fad137f5 test

# Fetch 3

 ~/argo-cd-fork-2 ((3fad137f…)|✔) $ git fetch origin a94ab16b0964c2b583f8b923ad5a84b2a6b2b716
 remote: Enumerating objects: 47715, done.
 remote: Counting objects: 100% (6/6), done.
 remote: Compressing objects: 100% (6/6), done.
 remote: Total 47715 (delta 5), reused 1 (delta 0), pack-reused 47709
 Receiving objects: 100% (47715/47715), 40.90 MiB | 5.89 MiB/s, done.
 Resolving deltas: 100% (31972/31972), done.
 From https://github.com/crenshaw-dev/argo-cd
  * branch                a94ab16b0964c2b583f8b923ad5a84b2a6b2b716 -> FETCH_HEAD
 ~/argo-cd-fork-2 ((3fad137f…)|✔) $ git checkout FETCH_HEAD
 Previous HEAD position was 3fad137f5 test
 HEAD is now at a94ab16b0 test

# Fetch 4

 ~/argo-cd-fork-2 ((a94ab16b…)|✔) $ git fetch origin bf651bfc6653b6cf13a522d590a8779fc3b66a77
 remote: Enumerating objects: 47716, done.
 remote: Counting objects: 100% (7/7), done.
 remote: Compressing objects: 100% (7/7), done.
 remote: Total 47716 (delta 6), reused 1 (delta 0), pack-reused 47709
 Receiving objects: 100% (47716/47716), 40.90 MiB | 7.31 MiB/s, done.
 Resolving deltas: 100% (31973/31973), done.
 From https://github.com/crenshaw-dev/argo-cd
  * branch                bf651bfc6653b6cf13a522d590a8779fc3b66a77 -> FETCH_HEAD
 ~/argo-cd-fork-2 ((a94ab16b…)|✔) $ git checkout FETCH_HEAD
 Previous HEAD position was a94ab16b0 test
 HEAD is now at bf651bfc6 test

# Fetch 5

 ~/argo-cd-fork-2 ((bf651bfc…)|✔) $ git fetch origin 81895cf2a3f6e030aef7ddadc390b7a7743af03d
 remote: Enumerating objects: 47717, done.
 remote: Counting objects: 100% (8/8), done.
 remote: Compressing objects: 100% (8/8), done.
 remote: Total 47717 (delta 7), reused 1 (delta 0), pack-reused 47709
 Receiving objects: 100% (47717/47717), 41.00 MiB | 9.17 MiB/s, done.
 Resolving deltas: 100% (32005/32005), done.
 From https://github.com/crenshaw-dev/argo-cd
  * branch                81895cf2a3f6e030aef7ddadc390b7a7743af03d -> FETCH_HEAD
 ~/argo-cd-fork-2 ((bf651bfc…)|✔) $ git checkout FETCH_HEAD
 Previous HEAD position was bf651bfc6 test
 HEAD is now at 81895cf2a test

 ~/argo-cd-fork-2 ((81895cf2…)|✔) $ du -sh .
 242M    .

CodePudding user response:

Because each fetch produces its own packfile and one packfile is more efficient than multiple packfiles. A lot more efficient. How?

First, the checkouts are a red herring. They don't affect the size of the .git/ directory.

Second, in the first example only the first git fetch origin does anything. The rest will fetch nothing (unless something changed on origin).

Why are multiple packfiles less efficient?

Compression works by finding common long sequences within the data and reducing them to very short sequences. If <div>long block of legal mumbo jumbo</div> appears dozens of times it could be replaced with a few bytes. But the original long string must still be stored. If there's a single packfile it must only be stored once. If there's multiple packfiles it must be stored multiple times. You are, effectively, storing the whole history of changes up to that point in each packfile.

We can see in the example below that the first packfile is 113M, the second is 161M, the third is 177M, and the final fetch is 209M. The size of the final packfile is roughly equal to the size of the single garbage compacted packfile.

Why do multiple fetches result in multiple packfiles?

git fetch is very efficient. It will only fetch objects you not already have. Sending individual object files is inefficient. A smart Git server will send them as a single packfile.

When you do a single git fetch on a fresh repository, Git asks the server for every object. The remote sends it a packfile of every object.

When you do git fetch ABC and then git fetch DEFs, Git tells the server "I already have everything up to ABC, give me all the objects up to DEF", so the server makes a new packfile of everything from ABC to DEF and sends it.

Eventually your repository will do an automatic garbage collection and repack these into a single packfile.


We can reduce the examples. I'm going to use Rails to illustrate because it has clearly defined tags to fetch.

git init
git remote add origin https://github.com/rails/rails.git
git fetch origin
du -sh .git/objects/pack/*
22M .git/objects/pack/pack-ef0a91833c4774a28a21c814a26e04043621512d.idx
209M    .git/objects/pack/pack-ef0a91833c4774a28a21c814a26e04043621512d.pack

and:

git init
git remote add origin https://github.com/rails/rails.git

git fetch origin v5.0.0
du -sh .git/objects/pack/*
13M .git/objects/pack/pack-7be7f8792d634f63a623e50165a11983e7cdaeef.idx
113M    .git/objects/pack/pack-7be7f8792d634f63a623e50165a11983e7cdaeef.pack

git fetch origin v6.0.0
du -sh .git/objects/pack/*
13M .git/objects/pack/pack-7be7f8792d634f63a623e50165a11983e7cdaeef.idx
113M    .git/objects/pack/pack-7be7f8792d634f63a623e50165a11983e7cdaeef.pack
16M .git/objects/pack/pack-c81c5343636211ffcc9ffdfeeb3bb65b9cba75df.idx
161M    .git/objects/pack/pack-c81c5343636211ffcc9ffdfeeb3bb65b9cba75df.pack

git fetch origin v7.0.0
du -sh .git/objects/pack/*
18M .git/objects/pack/pack-2d2066f04670f137265fed0f382ad0d6f0dd9f3e.idx
177M    .git/objects/pack/pack-2d2066f04670f137265fed0f382ad0d6f0dd9f3e.pack
13M .git/objects/pack/pack-7be7f8792d634f63a623e50165a11983e7cdaeef.idx
113M    .git/objects/pack/pack-7be7f8792d634f63a623e50165a11983e7cdaeef.pack
16M .git/objects/pack/pack-c81c5343636211ffcc9ffdfeeb3bb65b9cba75df.idx
161M    .git/objects/pack/pack-c81c5343636211ffcc9ffdfeeb3bb65b9cba75df.pack

git fetch origin
du -sh .git/objects/pack/*
18M .git/objects/pack/pack-2d2066f04670f137265fed0f382ad0d6f0dd9f3e.idx
177M    .git/objects/pack/pack-2d2066f04670f137265fed0f382ad0d6f0dd9f3e.pack
13M .git/objects/pack/pack-7be7f8792d634f63a623e50165a11983e7cdaeef.idx
113M    .git/objects/pack/pack-7be7f8792d634f63a623e50165a11983e7cdaeef.pack
22M .git/objects/pack/pack-b28e1368cf8e1ee0152e7dd7b328760c5b589c40.idx
209M    .git/objects/pack/pack-b28e1368cf8e1ee0152e7dd7b328760c5b589c40.pack
16M .git/objects/pack/pack-c81c5343636211ffcc9ffdfeeb3bb65b9cba75df.idx
161M    .git/objects/pack/pack-c81c5343636211ffcc9ffdfeeb3bb65b9cba75df.pack

And after garbage collection this is all collected into a single packfile roughly the same size as the single fetch.

git gc
du -sh .git/objects/pack/*
22M .git/objects/pack/pack-7f1d7066fb6c5bd6a47749b215c020fab5ca416b.idx
212M    .git/objects/pack/pack-7f1d7066fb6c5bd6a47749b215c020fab5ca416b.pack
  •  Tags:  
  • git
  • Related