Home > OS >  Git: how do I set a folder in my project equal to that in a remote branch?
Git: how do I set a folder in my project equal to that in a remote branch?

Time:10-11

I have a feature/mine branch that I want to merge into feature/other, however somewhere during development there were changes made in specific folders that are still in feature/mine and should not be in feature/other. So, the diff in my PR shows changes in src/some-folder-I-dont-want-to-change, but these changes were made deep down the commit tree. I can't simply revert some commits.

What I want is to simply set my src/some-folder-I-dont-want-to-change to feature/other's src/some-folder-I-dont-want-to-change. Is there a way to do this?

I tried git checkout feature/other -- src/some-folder-I-dont-want-to-change but that just adds files from feature/other, it doesn't remove files that are on my branch but shouldn't be.

CodePudding user response:

Sounds like you're looking for git restore, see e.g. https://stackoverflow.com/a/15404733/6060876.

I believe for your usecase, you would need to write something like the following:

git restore --source=feature/other --staged --worktree -- src/some-folder-I-dont-want-to-change

CodePudding user response:

TL;DR

You may just want to check out some files from a particular commit, and commit again. Then the difference from the merge base to your branch tip will not include any changes to those files, or will have the same changes to those files, depending on what exactly you need. This may set up future problems, so you need to be aware of this.

You may, instead, want to use git rebase to replace your existing commits with some new-and-improved commits. This is more complex, and may set up some immediate problems, but is likely to solve the future problems.

Long

You're thinking of your project as if it were a simple collection of files, but you are using Git. Git is not about files; Git is about commits. A commit contains files—in fact, each commit has a full snapshot of every files—but it is a commit, not some files. Some people then think that Git is about branches, but that's not right either. Branches help us (and Git) find commits. But Git is all about commits. So you need to think about commits here, not files or folders.

This doesn't solve your problem, but you'll need to know all about it in order to solve your problem. So read on.

What to know about commits

A commit, in Git, is a numbered entity. Each commit gets a unique number, which Git calls a hash ID or sometimes an object ID. These aren't simple counting numbers, though: they don't go commit #1, then #2, #3, and so on. Instead, each commit's unique number is enormous—between 1 and 2160-1—and apparently-random, and normally expressed in hexadecimal, such as cefe983a320c03d7843ac78e73bd513a27806845.

These numbers are quite useless to humans (so we mostly don't use them—you'll occasionally use one with cut-and-paste if needed), but they are how Git actually finds the commits. So Git will need them. We'll see in a moment how you avoid typing them in.

Each of these numbered entities—these commits—holds two things:

  • Each commit has a full snapshot of every file. This snapshot is in a special, read-only, Git-only, compressed and de-duplicated form. The files inside the commit are not usable by anything but Git itself.

  • Separate from the snapshot, each commit has some metadata, or information about the commit itself: who made it (name and email) and when, for instance.

Crucially for Git itself, each commit holds the raw hash ID of some earlier commits. Most commits—the ones Git calls ordinary commits—hold the hash ID of the immediate preceding commit.

What all this means for you is that you only have to tell Git the hash ID of the latest commit. Suppose we draw this, using uppercase letters to stand in for the actual commit hash IDs, which are too ugly to bother with. Let's use H to stand in for the latest Hash ID:

              H

In H's metadata, Git has stored the raw hash ID of some earlier commit. Let's call that commit G:

          G <-H

By reading H, Git will be able to get the hash ID for G, and thus be able to read G. So we say that H points to G.

But G itself is a commit too, so it has the hash ID of some earlier commit F, which in turn has the hash ID of yet another even-earlier commit:

... <-F <-G <-H

Hence, given the hash ID of the latest commit, Git can easily find every previous commit, all the way back in time. Git simply follows the backwards-pointing internal arrows, one step at a time.

Since H and G both have complete snapshots—in Git's special de-duplicated format, no less—Git can easily retrieve both G and H and compare them, to see what files changed, if any. That lets Git figure out what you did, if you made H, and show you changes. Git didn't store the changes, it just computes them when you ask it to show H, on the theory that showing the changes since G is more interesting than showing the raw contents of H.

In fact, it's not just the snapshot that's read-only: all parts of any commit are completely read-only. So these internal arrows, once a commit is made, are set that way forever.

Branch and other names

To do all this, though, Git needs to know the hash ID of that latest commit. This is where branch names enter the picture. We get a little lazy about drawing the arrows from commit to commit, and draw them like this:

...--F--G--H   <-- main

Here the name main points to commit H, the way H points to G. Unlike the arrows inside commits, though, the arrows coming from a branch name can change: we can make main point to G if we really want to. (That's a lot of what git reset is about, though I'm not going to cover it here.)

We can make more branch names. The only limit Git makes on our branch names is that they must point to actual commits. So we can have two names that both point to H, like this:

...--G--H   <-- develop, main

Now that we have more than one name, we need a way to remember which name we're using. Right now both names find commit H, so in some sense it doesn't matter which name we're using, but we are about to change that. So, to remember the name we're using, we'll attach the special name HEAD, written in all uppercase like this, to one branch name:

...--G--H   <-- develop, main (HEAD)

If we now use git switch develop or git checkout develop, we get:

...--G--H   <-- develop (HEAD), main

We are still using commit H, we're just doing that through the name develop now.

Besides branch names, Git lets us have tag names and remote-tracking names. These will generally also point to commits (though tag names often do so through an extra Git object, so as to be able to store annotations). I'll come back to this in a while.

Adding new commits: Git's index or staging area

When we make a new commit, we:

  1. use git checkout or git switch to pick a starting commit;
  2. modify some files and run git add on them;
  3. run git commit.

You might have wondered why we have to run git add over and over again: didn't Git get the memo about those files the first time?

The answer here has to do with Git's index. This part of Git is quite central and important, and Git forces you to know about it. Other version control systems may have something like the index, but keep it well hidden so that you don't have to care, but Git is very insistent here. (It's so important that it actually has three names: it's not just the index or staging area: Git sometimes calls it the cache. This is probably the worst of the three names, and is now mostly just found in flags like git rm --cached.)

Now, we already know that the snapshot inside a commit is read-only, and in fact only Git can read these files. So Git is going to have to copy the files from the commit into a more usable form. These usable copies live in what Git calls your working tree: the place you do your work.

That's pretty straightforward: "checking out" some commit means extract its files. What's not straightforward is that the way Git does this is in two steps:

  • First, Git "copies" the files to its index, removing from its index the files that are there from the previously checked out commit.

  • Then Git copies the files from the index to your working tree.

The word "copies" in the first step is in quotes because what Git puts in the index is in Git's compressed and de-duplicated format. Since these files all came out of some commit, they're automatically duplicates. That means they take no space.1 But still, they act like copies.

The copies in your working tree are ordinary files, in your computer's ordinary format, de-compressed, so they really are copies. And that's why you have to run git add so often.

What git add does is read the working tree version, compress and Git-ify it, and check to see if it's a duplicate. If it is a duplicate, there's already a copy of the data in the repository, and Git re-uses that copy. If it's not a duplicate, the compressed data are now ready to be the first copy, and Git uses that copy. Either way, Git now updates the index entry for the file and the file is ready to be committed.

So git add really means update my proposed next commit, and the index is really your proposed next commit. It starts out matching the current commit and as you git add files to it, you update the proposed commit.

This makes git commit's job faster and easier. When you run git commit:

  1. Git gathers any needed metadata, such as your name and email address and a log message, for the new commit. Git uses the current commit, as found by reading HEAD to see which branch name is the current branch, and reading the branch name to find out the commit's hash ID, as the parent for the new commit.

  2. Git turns the proposed commit, in the index, into an actual commit, using the metadata from step 1. Git writes all this out as a new commit object, into the big commit-objects-database. Since this is a new commit, it gets a new, unique, random-looking (but not random) hash ID: this is the first time Git actually finds the hash ID. Since the hash ID depends on all of the data—including not just your name and email address, but also the exact second at which you make the commit—there's no way to know in advance what the hash ID would be.

  3. As its last trick, git commit writes the new commit's hash ID into the current branch name.

The result is that we go from:

...--G--H   <-- develop (HEAD), main

to:

...--G--H   <-- main
         \
          I   <-- develop (HEAD)

where commit I is our new commit.

If we now run git checkout main, Git will remove, from its index and our working tree, all the files that are in commit I—they're safely saved forever2 in that commit—and fill in its index and our working tree from the files saved in commit H.3


1There's some space per index entry for the file's name, mode, cache data, and hash ID. The actual amount varies, but averages a bit under 100 bytes per file in a lot of cases.

2Forever, or as long as the commit itself exists, that is.

3Git plays a bunch of speed and cleverness tricks here: it can tell, easily, which files are different in H and I, and can skip the remove-and-replace for any files that aren't different. When taken to the extreme of switching between branch names that all point to the same commit, this makes git checkout or git switch really fast, as nothing actually changes except the binding of HEAD. It also means that we can switch branches with uncommitted work lying around, since Git won't have to swap out the files.


How branches grow; merging

This shows us how branches grow. We might start with:

...--G--H   <-- main

From here, we create two new branch names, feature1 and feature2 or feature/tall and feature/short or whatever. I'll just use br1 and br2:

...--G--H   <-- main, br1, br2

We pick br1 to be "on", check it out, and make two new commits:

          I--J   <-- br1 (HEAD)
         /
...--G--H   <-- main, br2

Now we check out br2 and make two other new commits:

          I--J   <-- br1
         /
...--G--H   <-- main
         \
          K--L   <-- br2 (HEAD)

We'll stop bothering to draw in the name main in a moment: it's not important, since what matters to Git are the commits. Note that commits up through H are on all three branches, while commits I-J are only on br1 and commits K-L are only on br2.

Let's now check out br1 and run git merge br2, and look very quickly at how git merge works:

          I--J   <-- br1 (HEAD)
         /
...--G--H
         \
          K--L   <-- br2

The git merge command needs to combine work. To do so, it has to find a common starting point: the best commit that's on both branches. Git will do the usual work-backwards-from-the-end thing to do this. (Technically Git uses the Lowest Common Ancestor algorithm for this, using the extension to DAGs.) In this case the best shared commit is obvious by eyeball though: it's commit H.

In order to combine work, Git has to figure out changes. To do that, Git will run git diff on two commits at a time. We start with:

git diff --find-renames <hash-of-H> <hash-of-J>   # what we changed

By comparing the snapshots in H and J, Git can figure out what files we modified, and what we did to those files.

Next, Git repeats this diff, this time from H to L:

git diff --find-renames <hash-of-H> <hash-of-L>   # what they changed

Again Git gets a list of files that were changed, and what happened to them.

The job of git merge is now to combine these changes. Git should then apply the combined changes to the base version.

For a file that we changed, and they didn't, that's easy: take our changes—or our version of that file from our commit. In other words, "file from H, plus our changes, equals file from J" so Git can just take the file from J. This applies even if we create a new file from whole cloth, for instance: if we added a file, and they didn't, Git can just take ours.

For a file that nobody changed, Git can take any version, as they're all the same.

For a file that they changed and we didn't, Git can just take their version. This also applies to files they deleted, that we didn't, for instance: Git can just delete the file.

Only when both of us changed the same file does it get a little tricky. Git now really does have to combine the two sets of changes. This can produce a merge conflict, if we both touched the same original lines. (I'm skipping over a lot of finicky details here too.)

Assuming all goes well, though, Git simply combines all of our and their changes, applies those combined changes to the files from H, and makes a new merge commit from the result. This new merge commit has a single snapshot—just like any commit—but instead of having a single parent, Git adds, to the usual single parent, a second parent. New merge commit M points back to commit J, like a regular commit would, but then also points to L:

          I--J
         /    \
...--G--H      M   <-- br1 (HEAD)
         \    /
          K--L   <-- br2

Then, having written out merge commit M, Git writes its hash ID into the current branch name—br1, where `HEAD is attached—as usual, and our merge is done.

Your situation

I have a feature/mine branch that I want to merge into feature/other, however somewhere during development there were changes made in specific folders that are still in feature/mine and should not be in feature/other. So, the diff in my PR shows changes in src/some-folder-I-dont-want-to-change, but these changes were made deep down the commit tree. I can't simply revert some commits.

At this point you're talking about a GitHub pull request, rather than a Git merge. GitHub PRs are specific to GitHub. Bitbucket also have Pull Requests, and I'm the one who added to your post, so perhaps you're actually talking about a Bitbucket pull request. Fortunately, while some details are different, the overall setup is the same here. (GitLab call theirs merge requests, and again, there are some differences in detail, but overall the idea is the same.)

To make a GitHub PR, you:

  • send your new commits to some repository on GitHub: either a GitHub fork, or a shared repository (it doesn't really matter which); then
  • use a GitHub web page clicky button, or the gh CLI script, to create the pull request: this generates a test merge on GitHub to see if the merge will work, and if not, they tell you about merge conflicts.

To make that test merge, GitHub:

  • found a merge base commit;
  • did a test merge, which presumably worked;
  • is now showing that you've changed some files in src/some-folder-I-dont-want-to-change

Assuming that you didn't change those files yourself, what that means is that the merge base that GitHub are using here is such that the common starting point comes before someone else changed those files.

That is, suppose you have:

         J   <-- feature/mine (HEAD)
        /
       I   <-- feature/third
      /
...--H
      \
       K--L   <-- origin/feature/other

where you made one commit J that changes files that aren't in this src/some-folder-I-dont-want-to-change location. The comparison between H and J, however, includes commit I.

Meanwhile, the comparison from H to L doesn't include any changes to src/some-folder-I-dont-want-to-change.

Your PR therefore shows changes to src/some-folder-I-dont-want-to-change.

You can, if you wish, obtain commit L—that's their most recent commit—and extract from it all the files that are in src/some-folder-I-dont-want-to-change, in that form, and commit the result:

         J--M   <-- feature/mine (HEAD)
        /
       I   <-- feature/third
      /
...--H
      \
       K--L   <-- origin/feature/other

Now, a comparison from H to M shows no changes to src/some-folder-I-dont-want-to-change. The problem is, your branch name feature/mine now means commit M, which—by going back one hop at a time—includes commit I, which means you have now backed out the changes from feature/third. That is, your commit M might as well be a revert.

If they (whoever they are) accept your updated PR, with its proposed merge of M into L, the result will include commit I, provided they use the MERGE button on GitHub. GitHub have three different clicky web buttons here:

  • MERGE: this does an actual Git merge. You've now set things up so that Git will believe that commit I is properly incorporated. This means whoever did make commit I has to re-do their work. That's the future time-bomb.

  • REBASE AND MERGE: this makes the Git software on GitHub copy each commit in the PR to a new-and-supposedly-improved commit. This will have a similar effect, though it changes how whoever did make commit I has to handle things.

  • SQUASH AND MERGE: this doesn't create a merge at all. This prevents this particular time-bomb. They replace all of "your" commits—including the commit I that you inherited from someone else—with a single commit. The effect is that whoever did make commit I doesn't have to re-make it, because nobody will ever see your commits J and M in the end (and you'll have to discard your branch, as with any squash).

Understanding how and why this works is the key here. If you're going to set up a future time-bomb, whoever made the commit that you are in effect reverting needs to know about this. If you're using the squash-and-merge method (which has other drawbacks), that has the beneficial side effect of defusing this entirely.

Side note: origin/feature/other

The name origin/feature/other is a remote-tracking name. This has the same effect as a branch name, in terms of finding commits. The key difference between a branch name and a remote-tracking name is that the latter is something your Git creates to remember some other Git repository's branch name. You cannot switch to a remote-tracking name (git checkout produces a detached HEAD, and git switch produces an error).

When you run git fetch to obtain commits from some other Git repository, your Git reads that other Git's branch names. To keep your branch names yours—to avoid overwriting yours with theirs—your Git software renames their branch names. Your Git then creates or updates these remote-tracking names instead. The remote-tracking name, such as origin/main or origin/develop or whatever, has the name of the remote origin stuck in front of it: hence the term remote-tracking name. (The Git documentation calls these remote-tracking branch names but I find the word branch here to have negative value: removing it produces an improved term.)

Rebasing

Assuming this diagram, or a similar one:

         J--M--N   <-- feature/mine (HEAD)
        /
       I   <-- feature/third
      /
...--H
      \
       K--L   <-- origin/feature/other

captures the real problem, you have another alternative. You can replace your existing commits J-M-N with new-and-improved commits that start from H, rather than from I. The git rebase command assists with this.

To do this, you would want to run:

git switch feature/mine
git rebase --onto <hash-of-H> feature/third

If the name main points to commit H, you could use:

git switch feature/mine
git rebase --onto main feature/third

You could use the raw hash ID of commit I rather than the name feature/third here as well. The general idea is that we want to tell git rebase two things:

  1. Put the copies after commit H: that's why we need the hash of commit H, or the name main. Anything that locates the right commit will work here.

  2. Using the current branch, copy commits, but don't copy commit I or anything earlier than I. Anything that locates commit I suffices here: the name feature/third does it, but so does the raw hash ID of commit I itself.

The rebase command will start at the current commit, here N, and work backwards until it reaches commits that are to be excluded from the copying: in this case, commit I and earlier commits. These are the candidates for copying.4 It then puts the to-be-copied commits into the right order, saving their actual hash IDs in a work-list. (An interactive rebase, git rebase -i, allows you to edit the work list, but you probably don't want that here.)

Once the list is ready, git rebase uses Git's detached HEAD mode to copy each to-be-copied commit, one at a time. Rather than detail how this works—though I will say here that it uses git cherry-pick or equivalent—I'll just draw the final result that you get if all goes well:

         J--M--N   [abandoned]
        /
       I   <-- feature/third
      /
...--H--J'-M'-N'   <-- feature/mine (HEAD)
      \
       K--L   <-- origin/feature/other

This final result has copies of your commits, but omits the not-copied I commit. So you no longer have any changes to the files you didn't change: your three—or however many—commits start from the same base commit as origin/feature/other. So if you now use git push --force to update your GitHub PR, the apparent changes to these other files will vanish.

No matter what merge strategy someone uses, the merge won't include any changes to these other files, and there is no time-bomb set up. So this method is often superior. This is what git rebase is for: to set up new-and-improved commits that do only what you want done.


4This candidates list is normally further winnowed by:

  • removing any merge commits;
  • removing from the list commits whose patch IDs match those in the upstream; and
  • using the fork-point trick if --fork-point is enabled.

In this case, none of these should have much effect. If you have merge commits in the list, though, things get much more complicated.

  • Related