Home > Software design >  merge files from branch A in branch B that only exist in branch B
merge files from branch A in branch B that only exist in branch B

Time:05-13

I have the following situation:

We use a software on our trusted AS400, for this software we have modified some sources ourselves in it's own library (A). The rest of the unmodified sources are in a different library (B). There is no version control present yet.

Now the company whom we bought the software from regularly makes updates to the sources. We recieve those updated sources in a different library (C).

Right now we just go through the updates by hand to see if there are any in library C that are also in library A, and if there are any we go in by hand and merge them by hand.

Now I wanted to automate this with git. I know how to convert the libraries into branches, but I don't know what the best way of merging them is.

I would need to merge branch C into branch A, but only the sources from branch C that are also present in branch C. As the rest will be merged into branch B.

Example:

Branch A: file1, file2, file4

Branch C: file1, file2, file3

(Branch B: file1, file2, file3, file4)

I would need to merge file1 and file2 from branch C into branch A. Ignoring file3 and file4. Any way to see that file1 and file2 are the ones that are present in both branches would be start too.

I hope somebody has a solution or can atleast point me in the right direction! I've looked for some solutions, but nothing fits my specific case sadly.

EDIT: I have since figured out how to get the sources present in both libraries:

SELECT TABLE_PARTITION FROM syspstat WHERE TABLE_NAME ='QRPGLESRC' and 
TABLE_SCHEMA = 'LIBRARY-A'
and TABLE_PARTITION in (
SELECT TABLE_PARTITION FROM syspstat WHERE TABLE_NAME ='QRPGLESRC' and 
TABLE_SCHEMA = 'LIBRARY-B'); 

TABLE_NAME = 'QRPGLESRC' can be replaced by either the file you wanna look into or just leave out if you wanna compare all.

CodePudding user response:

Git does not do what you are looking to have it do.

Once you understand what Git does do, however, you'll see how to accomplish your goal. You need at least one extra step (for which you may use Git, but you won't be doing a "Git thing", you'll just be using some commands).

First, you need to avoid too much thought about "branches". Git is not about "branches"; Git is about commits. The word branch in Git is badly overused, to the point of becoming almost meaningless, though there are two or three (or four or six or so) things—unfortunately, different things—that humans mean when they say the word "branch" (see also What exactly do we mean by "branch"?). Initially, though, it's best to just avoid the word "branch" entirely.

So: Git is about commits. But what, exactly, is a commit? Git's answer is to declare a few things about commits here:

  • Every commit has a unique number, normally expressed as a big ugly hexadecimal hash ID or OID (Object ID). This number appears random (though it isn't actually random), and is entirely unpredictable. Whenever you, or anyone, make a new commit in any Git repository, that commit gets a new number: one that has never been used before. It must never be used again, by any Git, for any other commit, because that number now means the commit you just made.1

  • To make the numbering scheme work, all parts of any commit are completely readonly. No part of any commit can ever be changed, not even by Git itself, once it's made.

  • Every commit stores a full snapshot of every file, frozen in time, as of the form (and existence) that that file has at the time you, or whoever, make the commit.

  • Every commit also stores some metadata, or information about this particular commit. The metadata include things like your name and email address. They include the date-and-time stamp for when you made the commit (this is part of what makes future hash IDs unpredictable: you have to know the exact second at which you'll make each future commit, for instance). And, importantly for Git's own operation, each commit's metadata stores a list of previous commit hash IDs.

Because the hash ID is the commit, in an important sense—every commit has a unique one and Git needs the hash ID to retrieve the commit—keeping a list of previous hash IDs means that commits are the history in a repository. A repository could consist of nothing but commits, in a big database of all-commits-in-this-repository, plus any supporting objects required by the commit objects (there are a bunch of those).

Making humans memorize hash IDs, though, turns out to be a bad idea. Humans are terrible at hash IDs. Humans want—neednames. Fortunately, computers are good at storing files full of name-to-hash-ID mappings. So Git provides a second, separate database, in which Git stores these names. These are branch names, tag names, remote-tracking names, and numerous other kinds of names, and each name stores exactly one hash ID.

For a branch name, the one stored hash ID is defined as the latest commit "in" or "on" that branch. Git calls this the tip commit. Since each commit stores a list of previous commits—and most commits store just one entry in this list—that gives Git a way to work backwards.


1Simple theory (in particular the pigeonhole principle) will tell you instantly that this scheme is doomed to fail someday, no matter how big the hash IDs are. The birthday paradox or birthday problem tells us that doomsday is actually closer than you might think. But for practical purposes, it's probably sufficiently far in the future that we'll all be dead and not care about the ultimate Git Doomsday. Still, Git is moving from SHA-1 to the larger SHA-256, which puts it off by more billions of years, probably.


How branch names find commits

Let's draw a very simple repository with just three commits and one branch name in it. The very first commit we ever made has some big ugly hash ID, but we'll just call it "commit A". Its list of previous commit hash IDs is necessarily empty, because there are no other commits at the time we make A. So it sits alone:

A

Then, using commit A as a starting point, we'll make a new commit (snapshot metadata) B. Git will arrange for commit B to remember commit A's hash ID as B's (single) parent. We say that B points to A, and draw it in like this:

A <-B

Using commit B, we make a new commit C. Git will store in C's metadata the hash ID of commit B, so that C points backwards to B:

A <-B <-C

All along, each time we make a commit, Git will stuff that commit's hash ID into the current branch name main. So initially we have:

A   <-- main

Then, once we make B, we have:

A--B   <-- main

(where I've gotten lazy about drawing the backwards-pointing arrows between commits: you'll see why in a moment). Note that the name main holds the hash ID of B now, so that main points to B.

Then we make commit C and get:

A--B--C   <-- main

At this time, let's make a new branch name. This new name might be develop. As we saw earlier, every branch name holds exactly one hash ID, and there are only three commits, so the new name develop must point to exactly one of these three commits. We can pick any of the three, but typically (and most easily) we pick the commit we're actually using right now: commit C. We get:

A--B--C   <-- develop, main

We now need a way to remember which name we are using to find commit C. At the moment, it doesn't matter, but as soon as we make one new commit, it will matter. So we will have Git attach the special name HEAD, written in all uppercase like this, to exactly one branch name:

A--B--C   <-- develop, main (HEAD)

This indicates that the branch name we are using is main. That branch name points to commit C, so we are using commit C.

We now run git switch develop or git checkout develop (both do the same thing) and get:

A--B--C   <-- develop (HEAD), main

We're still using commit C, but now we are doing so through the name develop.

When we make a new commit, Git will make the snapshot and metadata such that the metadata point back to existing commit C, so our new commit—which we will call D—looks like this:

A--B--C
       \
        D

(note: I don't have a good arrow font to draw D pointing to C so that it shows up well in all StackOverflow views). As always, Git updates our current branch name to point to the new commit. That name is the name to which the special name HEAD is attached, i.e., develop, so now we have:

A--B--C   <-- main
       \
        D   <-- develop (HEAD)

Now, if we git switch main or git checkout main, we get to see how the working tree works.

A brief look at your working tree

For space reasons, I won't go into much detail here (and will skip entirely over the index / staging-area, which is crucial to using Git), but remember that we mentioned that all parts of every commit are completely read-only. Not only is that true, but the files in each commit are stored in a special, Git-only format that only Git can read. The files' contents are de-duplicated within and across commits, and are compressed, sometimes highly compressed, in ways that make it too hard for most programs to use these files.

Of course, we use Git to store files in the first place, and if we could not get our files back out of Git, that would make Git useless. So when we check out a commit—with git switch (switch is the new verb, since Git 2.23) or git checkout (the old one)—Git will extract, from that commit, all of the frozen-for-all-time files.

Git needs a place to put these files, and that place is your working tree. Git will first remove, from the working tree, files that are there because of some other commit. Then it will un-archive the files from the selected commit. That gives you ordinary files—not some special Git-only thing—that all programs can read and also write/overwrite. These files are not in Git though they may well just have come out of Git. It's important to remember that your working tree files are not Git's files: they're yours. But Git will remove and replace them when you tell it to do that.

(Git's index becomes very important at this point, as it determines which files are tracked and controls the details of this whole remove-and-replace thing that Git does with your files. But again, we're skipping it.)

Merging

As we work with Git, making commit after commit, we accumulate more and more commits in a commit graph. For instance, we might have a main-line "branch" (this word is difficult to avoid) consisting of a series of commits ending at one with hash ID H, like this:

...--G--H   <-- main

But there might be some more "branches" that go on from H, e.g., like this:

          I--J   <-- br1 (HEAD)
         /
...--G--H
         \
          K--L   <-- br2

It's important to realize that commits up through H are on all the branches. Commits I-J here are drawn such that they are only on br1, and K-L are only on br2, but commits up through H are on both branches (and on main if the name main exists and points to H).

Note that the commits exist in their own right, whether or not there is a name pointing to them. Having a name like main pointing to H merely gives us a quick way to find that commit without knowing its hash ID. Having a name like main pointing to H gives us a way to find earlier commits, too, because Git can follow the backwards-pointing arrows from H to G, then G to F and so on. Git cannot go forwards from H to I or J, even though both I and J point backwards to H, because H does not know the hash IDs of either I or J. Those commits were created after commit H was created, and commit H was frozen for all time once it was created.

So, in the end, we need both names br1 and br2 at this point just to find the four commits that are "after" commit H. That will change in a moment, as you'll see. We don't need the name main at the moment, though, because either br1 or br2 suffices to find H. We only need the name main if we want to have a direct way to find commit H specifically by the name main.

In any case, our diagram shows us that we are "on" branch br1 right now, and hence using commit J. Those are the files we'll see in our working tree: the files that are in commit J.2 That's not so important to Git; what matters to Git is that our current branch name is br1 and our current commit is commit J.

We now run:

git merge br2

Git uses the name br2 to locate the tip commit of that branch (note ambiguous word; tip commit is not ambiguous though and means commit L). So Git now has hold of two commits: J and L.

The merge program now reads backwards through history, commit by commit, to find the best shared commit. This is some commit that is on both branches (the ambiguous word again: note how the meaning keeps shifting from "branch name" to "tip commit" to "set of commits"), and that is "better" than any other commit also on both branches.3 The merge command calls this best-shared-commit the merge base. In our example, the merge base is obvious based on how we drew the graph: it's commit H, the last commit that is on both branches.

The way git merge operates at this point is to run two git diff commands. We have not talked about git diff, so let's just very briefly say that in this case, it will compare every file in the merge base commit to every file in one of the two branch tip commits. For some files, the file exists in both commits and is identical in both commits. For some files, it exists in both commits but is different. For some files, the file might not exist at all in either the base or the tip commit. (If the file doesn't exist in either commit, there's nothing to compare, of course.)

Note that Git skips right over all the intermediate commits. It does not compare H, the merge base, to I. It compares H to J only. So whatever files are in both H and J, those files get compared. Some match and some don't, and for those that don't match, the diff finds what's different in those files.

For files that were in H but are not in J, Git calls that file deleted. For files that are not in H but do exist in J, Git calls that file added (newly created). We'll ignore the tricky case of renamed files.4

Meanwhile, as a separate step, Git compares what's in H to what's in L, skipping right over K entirely. Again, some files will exist in both, and some files will match in both, and some won't. Some files may be added or deleted.

The merge operation consists of taking these two separate diffs—these comparisons that turn snapshots into changes—and combining the changes. To combine "left side, H-vs-J, didn't touch file F, and right side did touch file F" Git gets to just keep the right-side version of the file. To combine "left side changed a file and right side didn't", Git gets to just keep the left side version. If both sides changed a file, Git has to combine the individual changes and apply both (combined) changes to the merge base version of the file.

For what I call high-level operations—"left side deleted file3" for instance or "right side created all-new file4"—Git takes that into advice, and if the other side didn't touch the file, or does not have the file, Git keeps the deletion or all-new file. However, if one side deletes a file and the other changes the same-named file, Git declares a high level conflict here.

When both sides start with a common merge base file and make different changes to it, and the two changes overlap in incompatible ways, Git declares a low level conflict on that file.

Conflicts, if they occur, cause the merge operation to stop in the middle. You, the programmer, are required to clean up the mess. For low level conflicts, Git will write, to the working tree, its own best-effort at putting both sets of changes into the file, and will mark up the conflicted areas with "conflict markers". Your job is to come up with the correct combined file, using any way you care to do it.

For high-level conflicts, Git will tell you what it did ("I left you version whatever of file F" for instance) and, once again, your job is to come up with the correct file name (and whether the file should exist at all and if so what its contents should be). You then use git add, git rm, git mv, and any or all other Git-index-affecting operations to adjust Git's index to store the correct merge result, whatever you decide that is.

If there are no conflicts, or after you fix all the conflicts and run git merge --continue, Git will now make one new commit. The new commit is made like any other commit is made: new commit M will have as a parent existing commit J, because br1 points to J before Git makes the new commit. Remember, we started with this:

          I--J   <-- br1 (HEAD)
         /
...--G--H
         \
          K--L   <-- br2

What's special about the new commit M is that it is a merge commit, so it has a second parent. It links back not only to existing commit J, but also to existing commit L. That's the commit we named on the git merge command line when we said git merge br2. So M points to both J and L, and then Git stuffs M's new hash ID, whatever that is, into the current branch name as usual:

          I--J
         /    \
...--G--H      M   <-- br1 (HEAD)
         \    /
          K--L   <-- br2

We now have a merge commit on br1 as the new tip commit of br1. Commit M allows Git to work backwards to both J and L, so it is now safe to delete the name br2, if we have no reason to have a name br2 to find commit L directly:

          I--J
         /    \
...--G--H      M   <-- br1 (HEAD)
         \    /
          K--L

Branch br2 is now gone, but the commits remain, and we can still find them. Since Git is all about the commits—the branch names are just there to make it possible for us to find the commits—all is well here.

Note that the snapshot in M is just like any other snapshot. It says that if and when you check out commit M, the files that Git should install into your working tree (and Git's index) are those in the snapshot in M. The only thing special about M is that it has two parents.


2Since our working tree is an ordinary folder on an ordinary file system on our computer, we can create files that Git has never heard of, and/or that did not come out of the current commit. These files will also be in our working tree, but Git won't be using or touching them. Git calls these untracked files. To be precise, these files are not in Git's index (there's that pesky index again).

3Technically, what Git is doing here is running the lowest common ancestor (LCA) algorithm on the commit graph. This can produce more than one commit node, and what Git does in that case gets complicated, but for the graph we've drawn here, and for most cases in practice, we mostly just get one commit.

4Git does not store rename operations, so it has to find them from the commit contents. This amounts to a form of guesswork. Git's process of doing this is adjustable, so you can run diff with arguments that tell it to do its guessing differently. You can pass this to git merge as well, so merges can find, or fail to find, renames based on parameters you specify when you run git merge. Git's rename detection is not perfect and it sometimes needs these kinds of hints (and sometimes it just fails altogether); this is one area where Git's merge code can always use a bit of improvement.


Back to your desires

I would need to merge file1 and file2 from branch C into branch A. Ignoring file3 and file4. Any way to see that file1 and file2 are the ones that are present in both branches would be start too.

As you can see, this doesn't really translate properly into Git-ese. Files are not in branches. They are in commits.

We could, however, use Git's facilities to inspect the tip commit of some branch names (such as branch-C and branch-A) and see what files exist in those commits. We could then make one or more new commits, on some branch names—perhaps not branch-A or branch-C, since these new commits are going to be the new tip(s) of whatever branch(es) we are on as we make them—in which we have removed files we don't want Git to see.

The problem now is that when we run git merge, we'll get some commit as the merge base, from which Git will do two git diff commands. We can control the set of files in the branch tip(s) by making new commits, but the merge base that Git will choose depends on the graph—the history, as recorded by the existing commits.

Fortunately, we know that the way git merge works is to compare just the base files to just the tip commit files. If we make sure that we delete file3 and file4 from any tip commit(s) in question that have them that shouldn't, Git will either not see them at all—they're not in the base before, and now not in the tip(s)—or see them as deleted: they are and were in the base but aren't in both tips. So Git will either have nothing to do, or will combine "delete file3" with "delete file3" or "delete file3" with "do nothing".

(Similarly, if file3 exists in the base, and is in both tips but different in both tips but we don't want Git to affect the file, we can make new tip commit(s) in which the base copy of the file is now the same in the tip copy or copies. The file will now be present, not absent, in the final merge snapshot, but it will match the base copy. Or we can pick either tip copy and make that the other new-tip-copy. These aren't cases you described, but knowing how git merge works, we know what we need to do if the case does come up.)

Because of the properties of merge-base-ness (LCA algorithms as linked in footnote 3), making a new tip commit after one or both existing tip commits won't change the chosen merge base. So we can run:

git merge-base --all

on the two existing tip commits to find the merge base commit's hash ID.5 You can then use the Git tools on the merge base commit to see what files exist there.


5If this spits out two or more hash IDs, the problem gets more complicated. Recursive merge will merge these commits with git merge first, to get a snapshot to use as a merge base. The "resolve" strategy will pick one of these at apparent random and use that. Neither is particularly nice, but you can collect these up and run git merge-base on these to figure out what the merge will do. Or you can just run git merge --no-commit and let Git do its thing, and then work out the details manually. We won't cover this here though, again for space reasons.


The tools you need

Besides the already-mentioned git merge-base command (use it with --all), you will need git ls-tree -r. Use this on any commit to get a list of the names of files stored in that commit.

To find the files that are common to any given pair of commit—needed to identify which files you want to have Git omit—use a program like comm, or write your own. Note that the output from git ls-tree -r is already sorted, so this program is easy enough to write.

Remember that git merge works with three commits, not two. And, as matt noted in a comment, there's always git merge-file, which performs a merge of a single file. It needs the same three inputs:

  • the two tip versions, "ours" and "theirs", and
  • a base version

To extract these files from a specific commit, consider using git restore (to get the file into your working tree) and then renaming each as you go, or use git show:

git show a123456:path/to/file.ext

extracts the specified frozen file (by path name, to the right of the colon) from the specific commit (to the left of the colon).

CodePudding user response:

I would need to merge branch C into branch A, but only the sources from branch C that are also present in branch C. As the rest will be merged into branch B.

Okay, you've got a single vendor base series. I don't know AS400 so I'm going to answer using posix conventions and leave translation to AS400-ese for someone else.

Job one: make a Git history of the vendor drops. Job two: add the history of your modifications of / selections from those drops. Job three: automate incorporating a new vendor drop with the changes you're carrying.

So let's start from scratch. You get vendor drops in what you're calling C. Each drop is a full snapshot, which is convenient since that's exactly what Git likes.

If you still have the old snapshots, you can build a Git history from them very, very easily:

git init history; cd $_
git checkout -b vendor
git --work-tree=/path/to/oldest/snapshot add .; git commit -m C0
git --work-tree=/path/to/next-oldest/snapshot add .; git commit -m C1
git --work-tree=/path/to/next-newer/snapshot add .; git commit -m C2
git --work-tree=/path/to/even-newer/snapshot add .; git commit -m C3

and so on. Now you've got a vendor-branch series:

C0---C1---C2---C3      vendor

with the commit messages chosen to make referring to the individual commits using Git's message-search syntax very convenient, :/C0 is Git's name for "the newest (reachable) commit with a C0 in its message"..

Next up: your existing A library series based off those drops, you need a step to record the additional ancestry when extending.

git checkout -b Alib :/C0
git --work-tree=/path/to/C0-based-A/snapshot add .; git commit -m A0    
git merge -s ours --no-commit :/C1    # record a merge from C1
git --work-tree=/path/to/C1-based-A/snapshot add .; git commit -m A1    

and so on with -s ours --no-commit merges to set up the added parent link, and the add-and-commit snapshot dance for the content and the first parent.

The example sequence so far will get you

   A0---A1   Alib
  /    /
C0---C1      vendor

with the exact contents you already have. Do it again for Blib to get

   A0---A1    Alib
  /    /
C0---C1       vendor
  \    \
   B0---B1    Blib

and extend for however far you actually have snapshots.


And there's your baseline. You now have a Git history for everything you still have records for.


For incorporating a new vendor drop and propagating your changes, the import really is this easy:

git checkout vendor
git --work-tree=/path/to/new/vendor/snapshot add .; git commit -m CN

but Git's not really set up for the on-the-fly subsetting you're doing. It's doable, and doable efficiently, but it's going to need a little deeper dive into what's going on.

You want to propagate only the changes to files on the "vendor" C branch you're already tracking in your A (or whatever) branch. Git works off commits, and the nice thing is absolutely everything else is window dressing, and you can cook up any commit you like.

Git keeps an index¹, pointers to the repo content for interesting paths in your work tree. git checkout loads the index and the work tree, git add adds to the repo's object db and updates the index entries, git commit writes any new trees (aka directory contents, "tree node" would be a more-accurate name for them but we're busy, "tree" works in context) for the currently-indexed content, this is the update dance.

git checkout Alib
( git ls-tree -r vendor; git ls-tree -r @ ) \
| sort -sk4 | uniq -df3 \
| git update-index --index-info

will load your index with (pointers to) the vendor-tip versions of only the files already in the Alib tip. You want to merge the vendor's interim changes to those files with your changes, so you want ancestry showing the previously-merged vendor versions of those files -- which you already have.

selected=$(git commit-tree -p @^2 -m - `git write-tree`)
git reset -q
git merge -s ours --no-commit vendor
git cherry-pick -n $selected
git commit

I smoketested this by making a history of a bunch of Git releases and then treating only the files whose names contain the letter 'w' as my subset with some rather arbitrary changes:

snaptemp=`mktemp -d`
newhistory=`mktemp -d`
git init $newhistory; cd $_

git -C ~/src/git archive v0.99  | tar Cx $snaptemp
git --work-tree=$snaptemp add .; git commit -m C0
git -C ~/src/git archive v1.0.0 | tar Cx $snaptemp
git --work-tree=$snaptemp add .; git commit -m C1
rm -rf $snaptemp; mkdir $snaptemp
git -C ~/src/git archive v2.0.0 | tar Cx $snaptemp
git --work-tree=$snaptemp add .; git commit -m C2
rm -rf $snaptemp; mkdir $snaptemp
git -C ~/src/git archive origin/master | tar Cx $snaptemp
git --work-tree=$snaptemp add .; git commit -m C3


rm -rf $snaptemp; mkdir $snaptemp
git checkout -b Alib :/C0
git archive :/C0 :*w* | tar Cx $snaptemp
cd $snaptemp; find -type f -exec sed -si 5a'Hi, I changed this file' {}  
cd -; git --work-tree=$snaptemp add .; git commit -m A0

git show --diff-filter=d :/A0    # just to check

and then to carry the patches forward one step (basically to get us a merged-from-vendor Alib commit that fits the recipe above)

git checkout Alib # already there but hey
# the references here are specific to this step in the smoketest 
( git ls-tree -r :/C1; git ls-tree -r @ ) \
| sort -sk4 | uniq -df3 \
| git update-index --index-info
selected=$(git commit-tree -p @^ -m - `git write-tree`)
git reset -q
git merge -s ours --no-commit :/C1
git cherry-pick -n $selected
git commit    

Now that's set up further steps can follow the recipe given in the body above

git checkout Alib
( git ls-tree -r vendor; git ls-tree -r @ ) \
| sort -sk4 | uniq -df3 \
| git update-index --index-inf
selected=$(git commit-tree -p @^2 -m - `git write-tree`)
git reset -q
git merge -s ours --no-commit vendor
git cherry-pick -n $selected
git commit

¹ you can keep as many index files as you want, I could have done this with fewer commands and less work tree churn using the GIT_INDEX_FILE environment variable but that's a step deeper into the land of faerie

CodePudding user response:

your AS400 source is stored in members of one or more source physical files?

If so, I would have all the source code involved copied to a folder on a PC. Then setup a 2nd folder on the PC that contains the source code of the vendor. Setup a Github repo and use Git to keep those two folders in sync.

Then have a standalone process which copies source code between the source members on the AS400 and the folder that contains the source code repo on the PC.

  • Related