Home > database >  How do I create and populate a gitignore file for a 15.5gb machine learning project?
How do I create and populate a gitignore file for a 15.5gb machine learning project?

Time:12-02

I'm working on an university project with ML, and the project got quite big, I usually don't use github but I need to format my pc and do not trust the Google Drive backup I have, therefore I wanna have a second one so I don't lose the code whatsoever.

I'm using Git with GitHub desktop, I'm not very knowledgeable in Git, so I'm having a hard time uploading this project, since it disconnects everytime I try to upload it, I'm pretty sure it is because of the size, any help with that?

The IDE I'm using is PyCharm and the Python version is 3.7, I already have a requirements.txt created.

I tried searching for pre made git ignore files, but it didn't work.

CodePudding user response:

A .gitignore file will not help you there - you need to remove the dependencies from your project's history. There are two ways to do that:

The traditional way involves git-filter-branch. I've done that once in the past. It works, but it's easy to get wrong.

The alternative is to use BFG. I have no personal experience, but it seems to be easier to use, and claims to be faster. So if I were you, I'd give BFG a try.

Whichever way you try, make a lokal backup!

When you're done rewriting history, you can use a .gitignore to prevent yourself from re-adding the unwanted files.

CodePudding user response:

Welcome to Stackoverflow!

As you already sensed by yourself, Git is not really made to work with volumes of data that are as large as you say (15.5GB). The most important thing you have to do right now is identify which files you want to keep track of, and which files are just "binary files" that don't have to be versioned. You don't have to use any other tool than your brain for this (but looking around with any type of file explorer will teach you a lot).

Deciding what to keep

It is important to be quite severe here. As a general approach (there can be exceptions), try to keep out the following files:

  • Any file that is >1MB. There will surely be exceptions, but in general this is a good rule of thumb.
  • Anything that is binary/non text based. Git is made to work with diffs on files and this is not user-friendly with non-text based files. Examples: images, videos, powerpoints, ...
  • Anything that is generated by code (for example results of compilations, or data processing, ...)
  • Anything that is generated by a tool you use (for example folders created by your IDE)

Creating a git repository

It seems like you have made a git repository already, but unless you have very important history you want to keep I suggest starting anew from where you are now. If it's for a university project I can imagine it being fine that you lose your history until now. If it's not fine for you to lose your history, you will have to change your history and delete large files from your repo (a risky operation I would not recommend to a new Git user. More info can be found in this SO post).

I'm suggesting to start a fresh repository because I feel you will learn more in this way, but if you prefer to change your history go ahead!

To start off a fresh repository, go to the root directory of your project and copy the .git folder to some place as a backup. This is often a hidden folder, and it contains all of your history!

Then, delete this .git folder (making sure that you have kept your backup .git folder somewhere).

After than, execute the git init command. You have a fresh git repository to work with! Typing git status will show a bunch of untracked files.

Populating your gitignore

The first thing we will do now is make our .gitignore file, before committing anything else. Let's say that you decided in your first step to ignore the following:

  • all *.xlsx files
  • everything inside of the build/ directory
  • all *.log files

In that case, you should create a text file (with any text editor: your IDE or notepad or anything) called .gitignore. Open this up with your text editor of choice and add the following text in there:

*.xlsx
build/*
*.log

Now save the file. You have made your .gitignore file! Now add and commit the file (using a good commit message) and type git status. You should see none of the unwanted files appearing! Now you can commit all the rest of your files (properly check git status to see that no unwanted files are tracked by git before committing them!) and you have a clean lightweight repo.

Maintaining your gitignore

It's normal for the gitignore file to evolve during the project. Don't hesitate to add new lines in there if a new file type/folder enters the project that is actually unwanted in the repository.

Hope this helps you a bit!

  • Related