Is git's implementation of the patience diff algorithm correct?

Question

This question on Stackoverflow seemed to be a good candidate for the application of the patience diff algorithm. However, while testing my potential answer, I found that git diff --patience doesn't work up to my expectations (and, in this case, is no different from the default diff algorithm):

$ cat a
/**
 * Function foo description.
 */
function foo() {}

/**
 * Function bar description.
 */
function bar() {}

$ cat b
/**
 * Function bar description.
 */
function bar() {}

$ git diff --no-index --patience a b
diff --git a/a b/b
index 3064e15..a93bad0 100644
--- a/a
+++ b/b
@@ -1,9 +1,4 @@
 /**
- * Function foo description.
- */
-function foo() {}
-
-/**
  * Function bar description.
  */
 function bar() {}

I would expect the diff to be:

diff --git a/a b/b
index 3064e15..a93bad0 100644
--- a/a
+++ b/b
@@ -1,8 +1,3 @@
-/**
- * Function foo description.
- */
-function foo() {}
-
 /**
  * Function bar description.
  */

In my understanding, unique common lines in this case are the two lines mentioning bar, and the longest common context around those lines should be the function bar() along with its docblock, which means that the diff should boil down to the removed function foo() together with its own docblock and the following blank line.

score 12 · Answer 1 · answered Oct 20 '16 at 16:13

No one else has addressed this for a while, so I'll take a stab at it. This is all purely high-level theoretical though, since I have not read the paper on the original patience algorithm.

The LCS (longest common subsequence) algorithms are all about reducing the time spent finding a minimal edit distance solution. The standard (dynamic programming) solution is O(MN) where M is the number of symbols in the original string and N is the number of symbols in the target string. In our case, the "symbols" are lines, and the "string" is the collection of lines, rather than strings with characters (where the symbols would be, e.g., ASCII codes). We simply fill in an M x N matrix of "edit costs"; when we're done, we produce the actual edit by tracing a minimal path backwards through the resulting matrix. See https://jlordiales.me/2014/03/01/dynamic-programming-edit-distance/ for an example. (Web page found via Google search: it's not something I had anything to do with, other than to scan it at high speed for correctness now. It seems correct. :-) )

Actually computing this matrix is fairly expensive for big files, since M and N are the number of source lines (usually approximately equal): a ~4k line file results in ~16M entries in the matrix, which must be filled in completely before we can trace the minimal path back. Moreover, comparing "symbols" is no longer as trivial as comparing characters, since each "symbol" is a complete line. (The usual trick is to hash each line and compare hashes instead during the matrix generation, and then re-check during traceback, replacing "keep unchanged symbol" with "delete original and insert new" if the hash misled us. This works fine even in the presence of hash collisions: we may get a very slightly sub-optimal edit sequence, but it will practically never be awful.)

LCS modifies the matrix-calculation by observing that keeping long common subsequences ("preserve all these lines") almost always results in a big win. Having found some good LCS-es, we break the problem into "edit the non-common prefix, keep the common sequence, and edit the non-common suffix": now we calculate two dynamic programming matrices, but for smaller problems, so it goes faster. (And, of course, we can recurse on the prefix and suffix. If we had a ~4k-line file and we found ~2k totally-unchanged, in-common lines near the middle, leaving ~0.5k lines at the top and ~1.5k at the bottom, we can check for long common sub-sequences in the ~0.5k "top has a difference" lines, and then again in the ~1.5k "bottom has a difference" lines.)

LCS does poorly, and thus results in terrible diffs, when the "common subsequences" are trivial lines like }, that have lots of matches but are not really relevant. The patience diff variant simply discards these lines from the initial LCS calculation, so that they're not part of a "common subsequence". This makes the remaining matrices larger, which is why you must be patient. :-)

The result is that patience diff is no help here because our problem has nothing to do with common subsequences. In fact, even if we discarded LCSes entirely and just did one big matrix, we'd still get an undesirable result. Our problem is that the cost of deleting:

- * Function foo description.
- */
-function foo() {}
-
-/**

(and inserting nothing) is the same as the cost of deleting:

-/**
- * Function foo description.
- */
-function foo() {}
-

The cost of either one is just "delete 5 symbols". Even if we weight each symbol—make non-empty lines "more expensive" to delete than empty lines—the cost remains the same: we're deleting the same five lines, in the end.

What we need, instead, is some way to weight lines based on "visual clustering": short lines at an edge are cheaper to delete than short lines in the middle. The compaction heuristic added to Git 2.9 attempts to do this after the fact. It's obviously at least slightly flawed (only blank lines count, and they have to actually exist, not just be implied by reaching an edge). It might be better to do the weighting during matrix-filling (assuming what's left, after doing LCS elimination, really is going through the full dynamic programming matrix). This is nontrivial, though.

score 2 · Accepted Answer · answered Oct 20 '16 at 19:57

I found a newer post by Bram Cohen, with his description of the patience diff algorithm that supports the observed output of git diff:

... how Patience Diff works -

Match the first lines of both if they're identical, then match the second, third, etc. until a pair doesn't match.

Match the last lines of both if they're identical, then match the next to last, second to last, etc. until a pair doesn't match.

Find all lines which occur exactly once on both sides, then do longest common subsequence on those lines, matching them up.

Do steps 1-2 on each section between matched lines

So the algorithm's emphasis on unique lines is undermined by the steps 1. and 2. which detect the common prefix and suffix even if they are made of noisy lines.

Such a formulation is a little different from what I've seen before that, and Bram admits that he has slightly changed it:

I've previously described it with the ordering a bit different ...

My question actually repeated the concern expressed in this comment to Bram's post.

Is git's implementation of the patience diff algorithm correct?

2 Answers2

Linked