To expound a bit on VonC's answer, note that the git diff
documentation says this about -C
and --find-copies-harder
:
For performance reasons, by default, -C
option finds copies only if
the original file of the copy was modified in the same changeset.
This flag makes the command inspect unmodified files as candidates
for the source of copy. This is a very expensive operation for
large projects, so use it with caution. Giving more than one -C
option has the same effect.
The reason this is "very expensive" is the approximate-match code. Before reading any further, think about what you see in a typical diff:
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 96d208d..80b2387 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -347,7 +347,7 @@ test_lazy_prereq TAR_HUGE '
test_cmp expect actual
'
-test_expect_success 'set up repository with huge blob' '
+test_expect_success LONG_IS_64BIT 'set up repository with huge blob' '
obj_d=19 &&
obj_f=f9c8273ec45a8938e6999cb59b3ff66739902a &&
obj=${obj_d}${obj_f} &&
Here we have—well, Junio Hamano has—deleted the original text_expect_success
line and inserted a new, slightly different test_expect_success
line.
Well, really, he's just inserted LONG_IS_64BIT
within the line. We've just glossed over a really big subject, which is: how do we choose "chunks" for diffing and for displaying? We pretend the right answer is "one line at a time" here. Internally, Git needs to make various different choices, and it does not always make the same choice, or make it consistently; it's done more ad-hoc, whatever works well for whatever purpose.
Given a diff—finding one is itself somewhat tough, somewhat compute-intensive; the naive algorithm is O(MN) and the Myers algorithm is roughly O(ND)—we can then define a similarity index: how much of the original t/t5000-tar-tree.sh
is in the new t/t5000-tar-tree.sh
, and how much is different? Git defines this as:
amount of addition/deletions compared to the file’s size
(again, from the git diff
documentation). See also Edward Thomson's answer here, which details the computation of the similarity index. For performance reasons it's done more as a kind of stripped-down diff, rather than a full diff.
There's one other important performance trick: exact matches (100% similar files) are handled up front simply by comparing the blob hashes. This obviates the need to run any diff, even a stripped-down one. Since files have exact matches when the change is just a rename, we get to do this very fast.
Diffing trees
Consider what git diff
has to do here. We're comparing two directory-trees, which we can call "left" (usually the old version) and "right" (new version). The left tree has some files: a.txt
, b.txt
, and so on (possibly in sub-trees like dir/e.txt
). The right tree has some files as well.
Suppose that the left and right tree all have the same files, by name. In this case, we probably just want to stop here, and declare that no files were renamed and no files were copied; we can probably just go on to diff the contents of each file, one pair at a time. This will be Git's default action.
Again, this is the default if we have no right-side files that have no left-side counterpart.
If we add -B
("break" big changes), however, we need to add another pass. We compare each already-paired file-pair (such as "left a.txt
, right a.txt
") pairwise. If the files are similar enough, we declare that they're the same file. If they're too different, we break the pairing, and go back to looking for renames and/or copies. The "amount of difference" that is "too different" is, of course, the argument to -B
.
(This is a little bit simplified—-B
actually has two arguments, one for "rewrite" detection and one for "rename" detection. The main function of this first pass is to record the similarity index. As noted earlier, for performance reasons, this similarity index computation does a different diff
than the one we'll see as a patch.)
Now we're back to the case where we have may have right-side files that have no left-side "original" version. Whether or not we have any such right-side files, we may, at this point, have left-side files that have no right-side version. This is where -C
and -M
come in.
Note that -M
is also called --find-renames
, and -C
implies -M
. That is, we may have just rename detection enabled, or we may have rename and copy detection enabled.
Rename detection
Let's look at rename detection first.
If it's disabled, declare all un-paired right-side files new. We're done! That was easy!
Otherwise, for each unpaired right-side file, try to find a similar-enough left-side file. This means we must diff every unpaired left-side file against this one right-side file.
Suppose we have R unpaired right-side files and L unpaired left-side files. How many comparisons will we do, assuming we don't come up with something clever? For each R we must look at every file in L, so the answer is R * L comparisons. But each comparison is a similarity-index diff, which is itself expensive.
If there are few unpaired files on either side, this is not that big a deal. But suppose there are 10,000 files on the left side. Well, it's still not a big deal as long as 9,995 of those files are already paired: now we have only 5 left-side unpaired candidates, no matter how many unpaired files we have on the right. Let's say, for thinking about it, that there are 10 of those, 10 R files, and 5 (out of 10,000) of these unpaired L files.
This is how -M
works. It only looks at unpaired left-side files. So it's not too absurdly expensive: 10 R files times 5 L candidates = 50 similarity diffs. Of course, if your computer can do about 100 similarity diffs per second, that's still about a half-second of compute time.
Computers are pretty fast now, so they can probably do even more than that. And suppose the change is just a rename, so that the similarity index is 100%. Git can detect this very fast, without doing any diff at all. Git does these first, and once these are paired up, they drop out of the "hard" R set, making -M
even less expensive.
Copy detection
This is also how -C
works by default, more or less. I say "more or less" because instead of looking at unpaired left-side files, it looks at modified or unpaired left-side files. That is, file a.txt
exists in both the left and right trees, but if the hash for a.txt
on the left does not match the hash for a.txt
on the right, Git adds a.txt
to the left-side set. Let's say that besides the 5 unpaired L files, there are 7 modified files. So instead of 50 (5 * 10) similarity tests, Git now has to do 120 ((5 + 7) * 10) similarity tests.
Adding --find-copies-harder
, however, tells Git: don't look just at unpaired or modified files, look at every left-side file. Now we have 10 new files times 10,000 old files: one hundred thousand similarity index values to compute. Even if we can compute 1000 per second, that's still 100 seconds to find 100,000 similarity index values.
Conclusion
This means one -C
option, or setting diff.renames
to copies
(rather than just true
) in your configuration, is not that expensive. Using --find-copies-harder
is still pretty expensive, though.