Home > Enterprise >  Clear 'hiiden' large files in gitlab repo
Clear 'hiiden' large files in gitlab repo

Time:11-23

I know that files I backup to gitlab are python script files and jupyter notebooks. However, my gitlab repo says I'm currently using 9.8GB (shocking!).

enter image description here

I really do not intend to commit large file to the repo (e.g. data files). Visual inspection doesn't show me those large files so I can remove them. All I see are the python scripts files.

How do I clean my repo free of those large files?

CodePudding user response:

The large files commit history is still available with gitlab, even though you deleted those 'large files'. You can view those files list using the following script from this answer.

git rev-list --objects --all |
  git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
  sed -n 's/^blob //p' |
  sort --numeric-sort --key=2 |
  cut -c 1-12,41- |
  $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

First create a script file give and the file permission to executable as:

vim history.sh   # and paste the above script into the file
chmod  x history.ch # give file exe permission

./history.sh  # to run file

This will report all the commit history and file sizes like so:

....
192e100aaf93  2.8MiB SMF/Checking/models/Model_0.h5
1b808a1a25ba  2.8MiB SMF/Checking/models/Model_2.h5
80168dc7ffb54 1.3GiB SMF/data/segments_instances_final.csv
775b60418498  1.5GiB Revised_KerasData_NoSmoothing.pickle
2341792d8c9b  4.2GiB geolife.sql
......

To Delete large files

Use the BFG-repo-cleaner to clean those files:

Note: assuming you already have java installed, download the bfg.jar file the above repo and copy it to your current directory.

  1. Clone your git repository (and make a backup of it):
$ git clone --mirror git://example.com/my-large-repo.git 
  1. Run the BFG to clean your repository up (e.g to clean files larger than 50MB):
$ java -jar bfg.jar --strip-blobs-bigger-than 100M my-large-repo.git

....
                            Before     After   
    -------------------------------------------
    First modified commit | fc7cf2f9 | a772ae4a
    Last dirty commit     | d4a1a3d4 | 9b345832

Deleted files
-------------

    Filename                                                    Git id                                                       
    -------------------------------------------------------------------------------------------------------------------------
    3Class_Instances.pkl                                      | ceebb395 (558.1 MB)                                          
    Beijing_KerasData.pkl                                     | 8681a270 (133.4 MB)                                          
    Filtered_Trajectory.pkl                                   | bfe06d09 (137.8 MB)    
      ....
  1. Strip out the unwanted dirty data
$ cd my-large-repo.git
$ git reflog expire --expire=now --all && git gc --prune=now --aggressive

Enumerating objects: 1306, done.
Counting objects: 100% (1306/1306), done.
Delta compression using up to 8 threads
Compressing objects:  78% (973/1238)238)
...
  1. Finally push back your clean repo:
$ git push

Source: here

  • Related