Home > Software engineering >  Git merge: possible to duplicate directory? trying to manage multiple versions in one repo
Git merge: possible to duplicate directory? trying to manage multiple versions in one repo

Time:06-08

I'll start by saying that I'm pretty new to git, and have mainly been using it with Atom's built-in plug-in (as well as Git-plus).

When I deploy to production, I'm going to need multiple versions of my code each in its own directory. My initial thought was to only manage one version of the code (in Dev branch) in a directory called dev, and when a version is ready: tag it, then merge into Release branch. In here I would run a script to change the namespaces, directory name (dev --> v1), etc. Then merge this into Main.

DEV                    RELEASE (pre-script)   RELEASE (post-script)  MAIN (PRODUCTION)
======================================================================================
app/                   app/                   app/                   app/
|--src/                |--src/                |--src/                |--src/                
|----globals/          |----globals/          |----globals/          |----globals/
|----api/              |----api/              |----api/              |----api/
|------dev/            |------dev/            |------v1/             |------v1/
|--------file1-dev     |--------file1-dev     |--------file1-v1      |--------file1-v1
|--------file2-dev     |--------file2-dev     |--------file2-v1      |--------file2-v1             

Then I go back to Dev and begin work on v2 (say I modify file1 and file2, and add a new file3) finish it, tag it, merge into Release.

Here's the issue with this Dev-Release merge!

Git is too smart and adds my v2 changes to Release's v1 folder instead of adding Dev's dev folder like I was mistakingly expecting (on which I would normally run the script to turn it into v2). So this is what it looks like:

DEV                    RELEASE (pre-script)   RELEASE (what I need)
===================================================================
app/                   app/                   app/
|--src/                |--src/                |--src/                
|----globals/          |----globals/          |----globals/
|----api/              |----api/              |----api/
|------dev/            |------v1/             |------v1/
|--------file1-dev     |--------file1-v1(mod) |--------file1-v1
|--------file2-dev     |--------file2-v1(mod) |--------file2-v1             
|--------file3-dev     |--------file3-dev     |------dev/
                                              |--------file1-dev      
                                              |--------file2-dev      
                                              |--------file3-dev      

I was expecting the Release branch to keep the v1 folder untouched and bring in the "new" dev folder, but it wasn't the case. Git recognized that dev and v1 were one and the same so it modifies the file in v1 instead of creating a new one in dev.

The only alternative that I can think of is to manually manage the "production-ready" directory, and do so outside of the repo. In this case I would no longer need the Release and Main branches, so once I'm done in Dev and tagged the version, I would just duplicate my repo into a temp directory (in which I run the script to turn dev -> v1), then move this v1 into the separate production directory. This seems a little silly and error-prone, but doable.

Any thoughts, comments, solutions, or even feedback? I basically thought it was possible to have a dedicated branch that contained all the versions of the code.

Thanks,

CodePudding user response:

TL;DR

You can probably make use of -X find-renames= with a high threshold value (e.g., 100), or -X no-renames to disable rename detection entirely.

Long

Personally, I think your best bet is probably a deployment process that happens outside Git, or at least, that does its work outside Git and then just makes a commit from the result. However, it's easy to get what you want from Git directly (though this has other consequences). The trick here is that for Git, each commit is literally a snapshot of all files, but, well:

For git merge as a command you run—"merge as a verb" as it were—Git locates three snapshots, runs two git diff operations, and then combines the two diffs, including detecting renames. The combined diffs are then applied to one of the three snapshots. The goal here is to combine work. This brings up what should be an obvious question, except that a lot of people seem to miss it:

If a commit is just a snapshot of files, what does "work" even mean?

The trick here is that a commit isn't just a snapshot of files. A commit is, instead, a snapshot plus metadata. The metadata in a commit include things like the name and email address of the author and committer of the commit in question, and any log message they cared to provide to help future readers of the code figure out why they made that particular commit. But, crucially for Git's internal operation, each commit also includes a list of raw commit hash IDs.

(Let's pause for a moment here to note that every commit has a unique hash ID. When I say unique, I mean unique: no other commit anywhere, in any Git repository, is allowed to use that same hash ID. Every commit you make gets a new hash ID, never used before, never to be used again. In theory, this can't work, but in practice it does; the hash ID space is big enough to make collisions unlikely, though it's possible to make them on purpose: see How does the newly found SHA-1 collision affect Git? There's a lot more one can explore here, but for practical purposes, each hash ID is unique, which means that the hash ID alone is sufficient to know exactly which commit is which.)

So, given some arbitrary commit whose hash we'll just call H for hash, we can have Git look up the commit from the hash ID. This includes a list of previous or parent commit hash IDs. Most commits have exactly one parent here; if H is such a commit, it contains some hash ID that, for convenience, let's just call G. We then say that commit H points to its parent commit G, and we can draw that:

        <-G <-H

Of course G is also a commit and therefore points to its parent F, which in turn is a commit that points to its parent:

... <-F <-G <-H

Induction says that the result of this is a backwards-looking chain of commits: we start at the end, with the latest commit, and we can find every earlier commit, all the way back to the very first commit, whose list of previous-commit hash IDs is empty so that we can stop going backwards.

This gives us a simple definition for "the work done in a commit" like commit H: we look up the snapshot for H, but also, using the metadata in H, we look up the snapshot for its parent G. Then we use git diff—a comparison or file differencing engine—to compare the two snapshots. Whatever spills out of this diff is the change made by commit H.

Branching

Where this becomes particularly interesting is when we get branch-and-merge sequences. Let's start with the branching part:

          I--J   <-- br1
         /
...--G--H
         \
          K--L   <-- br2

Here, we have a simple tree-like data structure formed by commits where the work diverged after commit H. We made two commits I-J on branch br1 (so that the name br1 points to the last such commit), and we made two commits K-L on branch br2 (so that the name br2 points to the last such commit.)

(Here we can pause for another moment to note that internally, in Git, a branch name just points to one single commit. We say that this commit is the last commit on that branch: it's the tip commit of the branch. Checking out the commit involves extracting the saved snapshot in that tip commit, and then that branch name is the current branch and that commit is the current commit. Then, if we make a new commit without changing branches, the new commit gets a new unique hash ID and points to the current commit as its parent, and then Git simply stuffs the new hash ID into the branch name so that the name now points to the new commit. That's how we got our divergent situation: we had two names both pointing to H. We advanced one, making I and J, then went back to the other name to get H again and then advanced the other name to make K-L.)

Merging

Now that we have this branch-y structure in the repository, we can see how git merge will do its "combine changes" work. We check out one of br1 or br2 (or use git switch to make it current and check out its tip commit, which means the same thing but was a new command in Git 2.23 because git checkout is overly complicated and does too many things). Let's say we switch to br1 here, and annotate it by attaching the special name HEAD to the branch name (which is in fact what Git does):

          I--J   <-- br1 (HEAD)
         /
...--G--H
         \
          K--L   <-- br2

We then run git merge br2. Git locates three commits:

  • commit J (--ours) because it is our current commit;
  • commit L (--theirs) because it is the commit we named on the command line, using the branch name, which translates to the tip commit of that branch; and
  • commit H because it is the best commit that's on both branches (or technically, the lowest common ancestor1).

So Git now runs git diff using the snapshot in H as the merge base commit, against the snapshot in J, to see "what we changed". Then Git runs git diff using the same merge-base snapshot against the snapshot in K to see "what they changed". The resulting line-by-line diffs can now be combined by a simple and stupid text-combination algorithm. Where the diffs don't overlap or abut, Git just takes both sets of changes. Where the diffs do overlap, Git takes one copy of the change if and only if they overlap exactly. Where the diffs overlap and don't match, or simply abut,2 Git will declare a merge conflict, stop the merge in the middle, and make you fix up the mess. We'll skip over all these details for space reasons here, though.

Then, having combined the two sets of diffs, Git applies the combined diffs to the snapshot in the merge base commit—i.e., commit H. This "keeps our changes and adds theirs", or equivalently, "keeps their changes and adds ours": the result is symmetric.3 However, the merge commit is not quite symmetric. Git now makes a new commit of type "merge commit", which we can draw this way:

          I--J
         /    \₁
...--G--H      M   <-- br1 (HEAD)
         \    /²
          K--L   <-- br2

New commit M goes on the current branch, like any new commit, so the name br1 now points to M. New commit M has a snapshot, like any commit: that's the snapshot that the merge-as-a-verb, combine-work process generated. But this merge commit differs from other commits in exactly one way: it has a list of two parents, rather than just one. It points backwards to commit J, the previous branch-tip commit, as its first parent (hence the tiny 1), and points backwards to commit L as its second parent (hence the tiny 2).4

Most of the time, we don't care about first-vs-non-first merge parents, but we can run git log --first-parent to follow only the first parent at merge commits, which lets us view a "main line of development" without having to see every commit on every feature: we see only the merge commits that bring in the feature. If you like this idea of being able to use --first-parent, you may wish to avoid git pull, at least if used without --rebase: see How can I prevent foxtrot merges in my 'master' branch? But that's another matter; let's press on.


1As we're using the full Directed Acyclic Graph version of LCA here, there can be more than one LCA. See one of my longer answers about recursive merge to see how Git handles this.

2Some merge algorithms allow abutting changes without complaint; Git doesn't, and there's no selector knob to turn to enable them. That's just a choice Git made at some point: it's not really fundamental to anything.

3In the case of conflicts, you can use flags that make the result asymmetric; we're only really concerned with the simple conflict-free case here.

4Git supports something it calls an octopus merge, in which a merge commit has three or more parents. We won't cover these here either but they don't really do anything special, they just make the Directed Acyclic Graph harder to think about.


Merges provide new merge bases

If we look at the way main-lines and feature branches work in a simplified world, we can see how each feature can be provided as a sort of "merge bubble" that joins into the main line:

                    o--o--o   <-- feature2
                   /       \
...--o--B---------M---------N   <-- main
         \       /
          o--o--o   <-- feature1

The merge base at M, back when we make M, is commit B; the merge base at N, once we make N, is commit M. These are much simpler than real-world merges though, which may get messy like this:

...--o--B----o----M--P--o--S--o   <-- main
         \       /    \     \
          o--o--L--N---R--o--T   <-- feature

Here we made a branch for a feature, branching off from commit B, which was the merge base when we made an early merge M back into the mainline:

...--o--B----o----M   <-- main
         \       /
          o--o--L   <-- feature

But we were not yet done with the feature, so we went on to make a new commit N on feature, then had an important fix to go directly to main which became commit P (I skipped the letter O because it's too similar to the lowercase o for the more boring commits here):

...--o--B----o----M--P   <-- main
         \       /
          o--o--L--N   <-- feature

We decided to merge the fix P into feature too, resulting in merge commit R (skipping Q as it also is too similar to O or o):

...--o--B----o----M--P   <-- main
         \       /    \
          o--o--L--N---R   <-- feature

But when we made R, what was the merge base? The merge base is the best shared commit, i.e., the best commit that's on both branches. That's no longer commit B, because if we start with N and work backwards, we see commit N and then L, and if we start with P and work backwards, we encounter commits M and then both o and L at the same time. So the merge base is now commit L!

Later, after making a few more commits, we have:

...--o--B----o----M--P--o--S   <-- main
         \       /    \
          o--o--L--N---R--o   <-- feature

Working our way back from the tip commit of feature, we pass through R and arrive at P; we arrive at P working backwards from S too, so P is our merge base here, when we make new commit T, and then add another commit to main:

...--o--B----o----M--P--o--S--o   <-- main
         \       /    \     \
          o--o--L--N---R--o--T   <-- feature

At this point commit S is now the merge base between main and feature. So you can see how the merge base, which is kind of the main control point for git merge, moves over time because we made earlier merge commits.

In essence, we have two rules:

  • the commits in the repository are the history in the repository;
  • merge commits in the repository provide for future merge bases.

That means that when we do make a merge commit, its snapshot and parentage are crucial to a future merge. It will help determine a future merge base, and the merge commit itself, or some derivative of it, will provide one of the snapshots that goes into the combining process.

Rename detection

When git merge runs the two git diff commands that it runs, it's git diff that decides whether some file was renamed. Renames in Git are detected rather than recorded: each commit holds a full snapshot of all files, but there is nothing in the data or metadata that says that in commit I, file foo.py is "the same file" as file bar.py in commit H, or not.

Other version control systems provide some sort of file identifier, e.g., an "inode number" or "file key" or something, so that we know that bar.py got renamed to foo.py. Git does not do this. Instead, if, looking at commit H as a whole, there's a file named bar.py, and looking at commit L as a whole, there's no file named bar.py but there's a "new" file named foo.py, Git effectively says to itself: Hey, maybe these are really the same file, just renamed!

This actually works surprisingly well in a lot of cases. Git notices: "hey, files X and Y existed on the left and are gone on the right, but new files F1, F2, F3, F4 appeared, maybe X is one of those and Y is one of those". Git computes a similarity index for all such file-pairings (with some tweaking and clever optimizations as this is otherwise at least O(mn) with large constants as well, but even then this computation is pretty expensive). If the similarity index exceeds some minimum threshold, Git will pair up the pairs with the best—highest—index values and call those "the same file, but renamed".

You can always run git diff manually before a merge: Use git merge-base to find the merge base commit yourself, and then run your own git diff commands with -M and whatever rename threshold you like. You may wish to add --name-status as well to suppress the actual diff and just get a list of files, including renamed files. (You might sometimes want --diff-filter as well; see the documentation.) When the list of renamed files matches what you want, you have found a suitable threshold value.

If you just want to disable rename detection entirely, git diff offers --no-renames as well. (Also, --find-renames= is a synonym for -M here.)

Given the appropriate -X options—I like to call these eXtended strategy options to help remember the flag letter5git merge will pass either --no-renames or --find-renames=number on to both git diff commands. This will determine which if any renames the git merge command "sees". When git merge is combining work, if it sees a rename on one side, and nothing on the other, it takes the rename; if it sees two identical renames, it also takes the rename; and if it sees two different renames, it declares a merge conflict. These high level or tree level conflicts at git merge time are not subject to resolution through -X extended-strategy-options.

Simply disabling rename detection entirely may work well for your particular use case. Note though that Git does not record the merge options used here, and if you ever run git rebase -r aka git rebase --rebase-merges, Git will re-perform any merges, but because it didn't save the options you used, will perform those merges as default merges, with the default rename detection.


5Git documentation calls the -s option the strategy option, and the -X options the strategy option options. So we have -s, strategy, and -X, strategy-option, which is just confusing. But if we think of these as -s = strategy, -X = eXtended-(strategy-)-option, then it all makes sense.

  • Related