Repetitive merges in GIT. How does it calculate differences?

Question

I've been doing a research of trying to understand how does the GIT merge works. I know there are several merge types as recursive, octopus, etc. I figured out that the resolve / recursive is used the most commonly. And that recursive merge is only useful when there are several common ancestors / bases.

However, I couldn't find which algorithm is used (or how the ancestor should be calculated) with repetitive merges to the master from the branch.

A simple example. Let's create an empty project with 1 file "A":

Then create another file "B" and commit to master

A
B

Then I create a branch from the very first version which only had 1 file "A" and create another file "C". So my branch looks like this:

A
C

Then I decide to merge my branch changes to master and I get:

A
B
C

Then I decide to go back to my branch and continue my work from there. I create another file "D"

A
C
D

Now I want to merge my changes from branch back to the trunk. How is the ancestor calculated?

A visual example:

If I take the ancestor "AC", it should say that "B" is also a new addition because it did not exist in two versions: branch and ancestor.

If I take the ancestor "ABC", it should say that "B" is deleted since B existed in two versions: master and ancestor.

Both of these options look incorrect. I tried to figure out it by using "Plastic SCM" which has a Merge explanation feature. As it shows, that the ancestor/base is being used as version "AC", however it still correctly calculated how many files were added (only 1 and not 2).

Git looks for the merge base commit, which is the closest commit reachable through both parents. In the ABC merge this is A, in the ABCD commit this is AC, then it uses changes introduced after that to compare. — Lasse V. Karlsen, Feb 06 '20 at 13:29
@LasseV.Karlsen wouldn't that mean that in the ABCD commit there were two files added: "B" and "D". However it only shows addition to "D". How does it know to ignore file "B"? — NeuTronas, Feb 06 '20 at 13:36
It does not ignore "B". It considers both "B" and "D" as new files. By looking at the "ours" side, the algorithm knows that "B" was added (and at which position it belongs), and by looking at the "theirs" side, it knows that "D" is new (and at which position it belongs). Note that you can swap "ours" and "theirs", and you still come to the same result; i.e., the direction of the merge is completely irrelevant. — j6t, Feb 06 '20 at 14:03
@j6t I simulated a similar situation in GitHub repository. (starting from completely empty repository to ABC). During the second merge only one file is shown as changed (which : https://github.com/Neutronas/3-WayMerge/commit/ce1d76b735d482ca053fe774f2f549021c87af26 — NeuTronas, Feb 06 '20 at 14:19
Of course. After the merge you have one new commit either on top of "ABC" or on top of "ACD". When you compare that commit to its first-parent, you will see only one new file. But the merge algorithm did something different: it compared to "AC" twice; and there it saw the new entry "B" in the first comparison, and the new entry "D" in the second comparison. — j6t, Feb 06 '20 at 14:36
So the correct information shown is responsible by the commit comparison and not by the merge itself. Very interesting. Thank you. I think that solves my issue. — NeuTronas, Feb 06 '20 at 14:38

score 5 · Accepted Answer · answered Feb 06 '20 at 23:00

To both summarize the comments, and address the question as asked...

Finding a merge base

Git computes the merge base of a pair of commits using an algorithm for finding the Lowest Common Ancestor of a Directed Acyclic Graph. The precise algorithm is not described anywhere and may change, as long as the new one produces correct results. See also Algorithm to find lowest common ancestor in directed acyclic graph?

There may be multiple LCAs. In this case, the -s resolve merge strategy picks one of them. You have no control over which one it picks. The -s recursive merge strategy runs git merge on them, two at a time, as if by the following:
```
commits=$(git merge-base --all $left $right)
if len($commits) > 1
    a=$commits[0]
    for i in range(1, len(commits))
        b=$commits[i]
        a=$(git-merge-recursively-inner $a $b)
    rof
    commits=($a)
fi
```
(in pseudo-code). Note that the inner recursive merge may itself find more than one merge base; if so, it uses this algorithm to merge them.

The final result is a single commit, $commits[0]. This is the merge base.
In any case, now that we have a single merge base commit—from the LCA-finding algorithm that only found one LCA, or by merge-recursive merging the multiple merge bases that came out of the LCA-finding algorithm, or by merge-resolve just picking one commit from the list—we can look at how git merge-(recursive|resolve) actually merges files. It must run two internal git diff operations, each with the rename detector turned on.

Diffs, and file identity / rename detection

A file difference engine compares two files. We put one file on the left and another file on the right. Where the two files match up, the diff says nothing. Where the two files differ, the difference engine—depending on how good it is—comes up with some set of changes we can apply to make the left-side's content match the right-side file's content.

To diff a pair of commits, Git puts one on the left and one on the right. Then it must pair up files in these two commits. Git can do this with a rename detector enabled, or not.

The picture is pretty clear when there is no rename detector. Files on the left and right are "the same file" if and only if they have the same name. Adding the rename-detector identifies (marks as "the same") some file(s) on the left and right sides of a diff, even if the names have changed.

Git's existing rename detector is undergoing some changes to make it better. The exact details are not required here: all we need to know is that it will say that some files are renamed, so are "the same" file, even if they have different names. Other files are automatically "the same" file because they have the same names.

For each paired-up file, the difference engine produces a set of changes that will make the left-side file become the right-side file. The rename detector produces rename operations that are required to be executed first. Files that are new in the right are called added, and files that existed in the left side commit, but do not exist in the right side commit, are deleted.

Hence, the diff-of-pair-of-commits results in:

files to rename (from old-name to new-name)
files to add
files to delete

plus some sets of changes for files that exist in both commits, as required.

Merging, given a merge base

Given a single merge base commit, both the resolve and recursive proceed in the same way:

Diff the merge base against HEAD, with rename detection enabled. These are our changes.
Diff the merge base against the other commit, with rename detection enabled. These are their changes.
Combine the changes.

"Combining" requires addressing both high-level changes, such as rename, add, and delete, and low-level changes within a single file. The file to which combined changes will be applied is the file from the merge base. That guarantees that the result works in all cases.

For instance, suppose we renamed a file and they modified the file we renamed. The combined changes say, in effect, at the end, rename file base.ext to head.ext; meanwhile, change line 17 of base.ext. So we'll change line 17, and rename the file, capturing both actions.

High level operations can conflict! For instance, if we rename a file and they delete it, that is a high level conflict. If both we and they rename a file, that is a conflict unless we both chose the same final name. If both we and they delete a file, that combines well with the obvious result.

Low level changes can also conflict. A conflict occurs if we and they both modify the same lines in different ways, or if our changes and their changes "touch" at either edge. For instance, if we replace lines 9 and 10 (delete 2 lines after line 8 and insert 2 lines after line 8) and they replace lines 11 and 12, our changes abut. Out of general caution, calls this a conflict.

Of course, if we and they make the same changes to the same original lines, that is not a conflict. Git simply takes one copy of those changes.

The extended option -Xours or -Xtheirs resolves low level conflicts by choosing one side (ours or theirs) to take, ignoring the other side. This works only for low level conflicts. Logically, it could apply to high level conflicts too, but it just doesn't.

Having combined all of our and their changes, Git will apply the combined changes to the snapshot found in the merge base commit. The resulting files can be committed automatically if there are no conflicts. This is the default action for these merges; use --no-commit to suppress this default commit.

When merge-recursive uses an inner merge to make a merge base commit, it forcibly commits the result even if there are merge conflicts. You do not get to see what it did with these conflicts, except in whatever shows up in the merge base when your (outer) merge has a conflict as well. (In this case, the merge-base copy of the file is available in index slot 1. Also, if you set merge.conflictStyle to diff3, each work-tree copy of a conflicted file will show the text from the merge base, complete with conflict markers.)

Repetitive merges in GIT. How does it calculate differences?

1 Answers1

Finding a merge base

Diffs, and file identity / rename detection

Merging, given a merge base

Linked