Home > Back-end >  Remove few files from old git commit
Remove few files from old git commit

Time:04-07

I'm trying to remove files that I accidentally included in my old commit (not previous or last commit) but running into large conflicts with other files I want to keep. I just want to remove the unwanted files and keep the ones I want in the particular commit. I'm using VS2022.

Say my local feature branch MyBranch has the commits: A -> B -> C -> D -> E. All the commits are pushed to the remote MyBranch branch as well.

Commit C has file1, file2, file3 and file4. I just want to remove the unwanted files 2, 3, 4 and just keep file1 in C both local and remote branches. MyBranch is my private feature branch and no one else but me is using it. If I revert commit C, file1 has lots of merge conflict. I'm wondering if there is a way to rewrite the history locally and update the remote as if MyBranch never included the unwanted files 2, 3, 4. Thanks

CodePudding user response:

TL;DR

Use git rebase -i and git push --force-with-lease or similar.

Long

Nothing, not even Git itself, can change any existing commit. But all is not lost here, it just means your job is more complicated.

You drew a set of commits I'll rephrase like this:

...--o--●   <-- main
         \
          A -> B -> C -> D -> E   <-- MyBranch, origin/MyBranch

It's important to realize that the arrows that connect commits—like A to B in your drawing—all go backwards and are part of the later commit. That's necessary, because once commit A is made, it cannot be changed. It contains an arrow pointing backwards to the last commit on the branch from which you started your branch MyBranch—the one I used for above—and it will forever point backwards to that commit. So a more accurate drawing looks like this:

...--o--●   <-- main
         \
          A <-B <-C <-D <-E   <-- MyBranch, origin/MyBranch

(where we're lazy and don't draw the earlier commits' arrows correctly, in part because my up-and-left arrows like ↖︎ and and just look kind of crappy). Besides these backwards-pointing "arrows" coming out of each commit, every commit contains a full snapshot of every file,1 so it's likely that the files you added in commit C, that you don't want, also exist in commits D and E.

In any case, the ultimate problem here is that you literally cannot change commit C. It will always have those files and always point back to B. Commit D will always have whatever it has and will always point back to C, and commit E will always have whatever it has, and will always point back to D. But ... what if you extracted commit C, fixed things up, and made a new and improved commit. Let's call this new-and-improved commit C', and draw it in:

...--o--●   <-- main
         \
          A--B--C--D--E   <-- MyBranch, origin/MyBranch
              \
               C'  <-- improved-branch

Now we want to take existing commit D and improve it very similarly: new commit D', our copy of existing D, should do to C' whatever D did to C, and should point backwards to C':

...--o--●   <-- main
         \
          A--B--C--D--E   <-- MyBranch, origin/MyBranch
              \
               C'-D'  <-- improved-branch

We repeat once more for commit E to get:

...--o--●   <-- main
         \
          A--B--C--D--E   <-- MyBranch, origin/MyBranch
              \
               C'-D'-E'  <-- improved-branch

and then we get rid of the name improved-branch and make the name MyBranch find commit E' instead of finding commit E:

...--o--●   <-- main
         \
          A--B--C--D--E   <-- origin/MyBranch
              \
               C'-D'-E'  <-- MyBranch

1Each commit:

  • is numbered, with a big, ugly, random-looking (but entirely not random), cryptographic-checksum hash ID;
  • is immutable;
  • contains two things: a full snapshot of every file, and some metadata.

The metadata give stuff like the name and email address of the author of the commit and the date-and-time for when they made it. It includes the log message they (you) wrote at that time. And, to make those "arrows", each commit has a list of previous commit hash IDs. Most commits have exactly one entry in this list, and that's our "arrow" coming out of the commit: the hash ID lets Git use this commit to find the previous commit.

Since each commit holds a full snapshot of every file—with the file contents de-duplicated within and across commits—Git can simply compare the snapshots in A and B, for instance, to see which files are the same and which are different. Git then shows you only the different files, and does that by computing a git diff to show you, rather than showing you the entirety of each file in each commit. But that diff is not what the commit stores. It actually has a full copy of every file (with the de-duplication taking care of the obvious objection, that this would fill up your disk way too fast).


Using git rebase to get this far

The command that actually copies commits (e.g., to turn E into E') is git cherry-pick, but we have to use it multiple times—three, in this case. We'd like Git's power tool here, and that's Git's interactive rebase. We run:

git switch MyBranch     # or git checkout MyBranch, if/as needed

and then:

git rebase -i HEAD~3    # 3 here means "count back 3 times from `E`

This brings up an instruction sheet that has three pick commands. These correspond to running git cherry-pick, which is Git's built-in command for making a copy of some commit. We don't want a full copy of C though, so we have to change that first pick command to edit, then write out this set of instructions and exit the editor,2 which makes git rebase start the whole process and do the first cherry-pick, but then stop for amending. We can now run:

git rm file2 file3 file4

and then:

git commit --amend

(which is a bit of a lie,2 but gets us C') and then:

git rebase --continue

which finishes the job—the remaining two commits are still marked pick—and gets us the desired result:

...--o--●   <-- main
         \
          A--B--C--D--E   <-- origin/MyBranch
              \
               C'-D'-E'  <-- MyBranch

2With some editors, you don't actually exit the editor, you just have the editor signal back to Git that the instruction sheet is done. The details depend on your editor. The git commit command, which brings up an editor for you to write your commit log message, works the same way, so whatever you are using for that—as long as it is not -m or something—will work here as well.

3Git cannot change any existing commit, and git commit --amend is no different. That's why --amend is a lie. What --amend does is:

  • make a new commit, wherever we are now, but
  • instead of having the new commit point backwards to the current commit, have it point backwards to the current commit's parent(s).

Also, git rebase -i will "cheat" if it can and not actually copy a commit, if that's possible. So when we changed pick to edit and wrote out the instructions and exited, Git didn't actually bother to copy C. It just got us into "detached HEAD" mode with C being the current commit, like this:

...--o--●   <-- main
         \
          A--B   D--E   <-- MyBranch, origin/MyBranch
              \ /
               C   <-- HEAD

Our git commit --amend uses whatever is in Git's index aka staging area, so git rm file2 file3 file4 updates that and then we run the git commit --amend command. This makes new C' with the same parent that C has—i.e., B—and points HEAD to C':

...--o--●   <-- main
         \
          A--B--C--D--E   <-- MyBranch, origin/MyBranch
              \
               C'  <-- HEAD

When we run git rebase --continue, Git picks up from wherever it left off in the instructions. That has two more pick commands, for D and E, so the rebase now does normal cherry-picks for these: no shortcut is allowed here. So at the end of the cherry-picking sequence, rebase has:

...--o--●   <-- main
         \
          A--B--C--D--E   <-- MyBranch, origin/MyBranch
              \
               C'-D'-E'  <-- HEAD

Now that rebase has reached the end of the instructions successfully, it yanks the branch name MyBranch off of wherever it was before (commit E) and pastes it on here (at E'), then "re-attaches" Git's HEAD:

...--o--●   <-- main
         \
          A--B--C--D--E   <-- origin/MyBranch
              \
               C'-D'-E'  <-- MyBranch (HEAD)

which is how I usually draw these things.


Your own repository is now repaired; GitHub's is not, yet

You now have the commits you want (plus some you don't want, but you cannot do anything about that) in your repository. You now need to send the new commits to GitHub, so that they can put them in their repository (the one you control over there). They don't exist there yet.

Normally you would just run:

git push origin MyBranch

which would have your Git call up their Git, enumerate any commits you have that they don't—in this case that's C', D', and E'—and send those commits and ask them to set their name MyBranch, which your Git remembers as origin/MyBranch.

If you do this now, though, you'll see the commits do get sent, but then GitHub rejects the request to update the name MyBranch:

 ! [rejected]    MyBranch -> MyBranch (non-fast-forward)

This is Git's way of saying "they complained that if they obeyed your polite request to update their MyBranch, they'd lose some commits off the end". The commits they would lose are, of course, commits C-D-E: exactly the ones you want them to lose.

To make them give up those commits, you need to use one of the "force" variants of git push, so that instead of sending a please, if it's OK, update your name MyBranch request, you send an update your name MyBranch! Do it now, dammit! command. That's git push --force.

To be more careful—which in this case you won't need, but it's usually wise to be careful with sharp saws like --force—you can use --force-with-lease. This sends, instead of either a polite request or an overriding command, a compromise: I think your branch name MyBranch identifies commit _______ (fill in the blank with the hash ID for E). If I'm right, change it to _______ (fill in the blank with another hash ID, this time for E'), even if that loses commits off the end of the branch. Let me know whether you did it. They will now make this check. Note that your Git supplies the hash ID for E based on your origin/MyBranch name, and supplies the hash for E' based on the fact that you ran:

git push --force-with-lease origin MyBranch

That is, the name MyBranch here supplied both hash IDs: one directly, and one via your Git looking up your origin/ variant of that name.

Using --force-with-lease takes care of the problem that occurs with a shared GitHub (or other site) repository to which multiple people might push commits. If someone else added on a commit F while you were fixing C-D-E to become C'-D'-E', your git push --force-with-lease origin MyBranch would fail, because your Git would send the hash ID of E when they actually now hold the hash ID of F. You can then run git fetch to obtain the new commit(s) and git cherry-pick them to your updated branch and try the --force-with-lease again.

Since nobody else is writing to this GitHub repository, you don't need the --force-with-lease, but it's good to know about.

  • Related