Home > Enterprise >  recover git files from merged master branches
recover git files from merged master branches

Time:04-16

I have a scenario in which I believe I have accidentally deleted all history and files from my local repository and am looking to see if this is in fact the case (and if any file recovery is possible) I initially cloned a repo from a remote repo and never created a new branch locally. I created several files in one of the local directories. I then accidentally merged the master branch from the remote repo back to local, causing those files to be deleted. I have no means to revert back to a different branch (as is the case on the other SO questions I have seen). There is no means to revert back to the 'pre-merge' local version is there?

edit: I also should note that the files in my local repo were never committed at any point. They were just saved within the folder that git init was run on.

CodePudding user response:

edit: I also should note that the files in my local repo were never committed at any point. They were just saved within the folder that git init was run on.

That's the key information, and it means that Git cannot help you get the files back, once they're gone. If you have some sort of OS-level file backup (such as macOS Time Machine), that's the way to recover them.

What to know before using Git

When using Git, it's important to understand the underlying model by which Git works. The details get very complicated, but an overall description is pretty simple: Git stores commits, and that's basically it. If it's committed, it's in Git. If it's not committed, it's not in Git.1

The repository proper consists mainly of two databases. One database holds Git's objects—commits and other supporting objects. The other holds names such as branch and tag names. Both databases are simple key-value stores, with the names database storing hash ID values using the names as the keys, and the objects database storing the objects using hash IDs as the keys. There are also a bunch of auxiliary files in the .git directory, but it's the object database that is copied, more or less wholesale, by git clone. The names database is used to seed a new, independent names database in the new clone, with the clone holding different names but the same hash IDs. The hash IDs are universal—shared among all Git repositories everywhere via a hashing algorithm—but the names are unique to each Git repository.2

So, the repository consists of these two databases, as stored inside the hidden .git directory. Git also stores a lot of extra data in the .git directory, which have varying amounts of importance to Git itself while you're using Git, but relatively less importance to you as a user of Git. That means that you can either say that the repository "is" the .git directory and its contents, or that the repository "is" the two databases: either claim is okay, especially if you qualify it as necessary.

But what about your working tree files? Well, before we get there, let's note one other feature of the objects database. The hash ID system that Git uses requires that objects never change. The hash ID of some object is simply a cryptographic checksum of the object's content: that's how Git manages, algorithmically, the trick of having the same ID everywhere. But it depends on this "never change" property.

What this means is that the committed copies of files literally can't change. Git therefore stores, in each commit, a full snapshot of every file, in a compressed, Git-ified, de-duplicated, and read-only fashion. The de-duplication handles the fact that most commits mostly have the same files as some previous commits. Since they are the same, they are de-duplicated and use no space at all. The read-only nature of the objects makes this possible, and the full-snapshot nature of each commit makes it necessary, in a sort of Ouroboros fashion.

But—this means you literally cannot use the committed files at all. Only Git can read them, and literally nothing, not even Git itself, can write to them. So what good are they? Well, like any archive, the snapshot in any given commit can be extracted. This is where your working tree comes in. When you select some particular commit—with git checkout or git switch for instance: you pick out some branch name, and the branch name picks out the latest commit that is currently "on" that branch—Git will extract the committed files. The files come out of the archive, and go into your working tree.

The working tree in a standard Git repository is the directory in which you ran git init, if you created the repository that way, or the directory where git clone put the new clone, if you created the repository that way. The hidden .git folder is stored inside that working tree, in the top level. The repository is in the .git directory but the files you use are in the working tree. This means that the files you use are not in Git! They're only in the working tree. Until you save them in a commit, these files aren't in Git. Some of them came out of Git, if you have existing commits and are using one of those; you can get those back because they are in a commit, and came out of Git. But any file you've modified in the working tree is just a file on your computer. It's not in Git. Git can't get it back for you, if you destroy it.

The bottom line is that files that are not committed are not in Git. That's why you should commit early and often. Git can get you those files back; it can't get back the ones you never committed.

(Note that it's also crucial that you not destroy the .git directory, through action or inaction. Don't put the .git folder in a cloud-shared / cloud-managed location as cloud-sharing software tends to damage Git's internal files. Cloud-syncers assume that humans are dealing with each file, and that humans know what to do with files named README.txt.cloudified-version-2 and so on. Git has no idea what to do if the cloud software renames its precious database files, and will think—correctly, at this point—that the repository is damaged.)


1There's a technical niggle here. If you have used git add on a file, this causes the file's content—but not the file's name—to be stored in the Git objects database. So sometimes this kind of content is recoverable. The name gets stored in Git's index, which is in an important sense less solid or more ephemeral than an object-database entry. Entities in the objects database persist for at least 14 days by default, regardless of other actions. Meanwhile, the index is constantly being updated every time you update the proposed next commit, or run git merge, or run git checkout or git switch or whatever. Entities in the object database are read-only; but the index is regularly overwritten, and an index entry that's lost this way is lost forever.

In any case, aside from the occasional ability to use git fsck --lost-found to recover the content of a file that was git add-ed but never committed, there's not a lot Git can do about files that did not make it into a commit.

2There's a sort of standard mapping, where if I clone your Git repository, my clone's names database stores your branch names as my remote-tracking names. My branch names, which are independent of your branch names, then become another third person's remote-tracking names when the third person clones my clone of your clone. Tag names, however, are by default copied as-is, so that the tag names are universal across all these clones. You, as the person running git clone and then additional Git commands, are in charge of this kind of mapping, so this is just a default standard. The fact that the hash IDs are universal is not under your control, but the way the names map is.

  •  Tags:  
  • git
  • Related