Force git status to acknowledge a copy and a type change

Question

How are the status codes 'C' and 'T' triggered in git? Can someone give me a really simple set of commands to run that will emulate the steps needed so that when i run git log --name-status i see the status codes above? I have tried copying files and changing the extension type of files that are already being tracked but it it's just flagging as either modified or added.

https://git-scm.com/docs/git-diff

Select only files that are Added (A), Copied (C), Deleted (D), Modified (M), Renamed (R), have their type (i.e. regular file, symlink, submodule, …) changed (T), are Unmerged (U), are Unknown (X), or have had their pairing Broken (B). Any combination of the filter characters (including none) can be used. When * (All-or-none) is added to the combination, all paths are selected if there is any file that matches other criteria in the comparison; if there is no file that matches other criteria, nothing is selected.

Thanks

score 4 · Accepted Answer · edited May 23 '17 at 11:58

4

For the C status (Copied), you need to add the --find-copies-harder

 git log --find-copies-harder --name-status -3

If you add a complete unmodified copy of a file in a new commit, you will see:

C100    b.bat   b1.bat

If the file is bigger, -C -M are enough to detect the copy.

For (T), a simple chmod 755 or chmod 644 on a file should be enough.
Or replacing a symlink by a file.
Or a gitlink (special entry recorded in the index of a parent repo, mode 16000) by a file.

--diff-filter is introduced quite early in Git: commit f2ce9fd, Git v0.99, Jun 2005

You can see the diff.h. It is used in diff.c#diff_resolve_rename_copy()

So if you copy a file over, with a different type, it is (T).
DIFF_PAIR_TYPE_CHANGED is defined as:

#define DIFF_PAIR_TYPE_CHANGED(p) \
    ((S_IFMT & (p)->one->mode) != (S_IFMT & (p)->two->mode))

(S_IFMT 0xF000 /* File type mask */, used in "How to read the mode field of git-ls-tree's output")

DIFF_PAIR_TYPE_CHANGED was introduced in Git v0.99 May 2005, very early on. And refactored in Git v1.4.0-rc1, Apr. 2006.

edited May 23 '17 at 11:58

Community

1
1

answered Jul 22 '16 at 15:55

VonC

1,042,979
435
3,649
4,283

100% correct about the --find-copies-harder for 'C' - thanks, the chmod is just flagging as a modified though. Can you explain why --find-copies-harder exists? the wording implies that it is trying to find copied files but this flag is just making it try harder? Is there anyway to flag a copy without this flag? Thanks – Mike Mellor Jul 22 '16 at 16:02
@MikeMellor I did not find a way to see the 'C' without that flag. – VonC Jul 22 '16 at 16:03
@MikeMellor It might have something to do with the actual size of the files - `git` may be more reluctant to declare a 100-byte file as a copy, because it's not really worth the effort compared to a 100-kbyte file... – twalberg Jul 22 '16 at 16:39
So has anyone got a solution for the 'T' status? – Mike Mellor Jul 22 '16 at 19:43
@MikeMellor The C is indeed related to the isze: I was testing it with a file too small. Looking in for the (T) – VonC Jul 22 '16 at 21:29
Note that the `S_IFMT` mask means a simple `chmod` is *not* a type-change. The bits under `S_IFMT` are the Unix/Linux file-type field of a `stat` structure, i.e., directory, file, symlink, device, or socket. Git steals the "directory" value to mean "gitlink" = submodule.) – torek Jul 22 '16 at 23:48
@torek yes, gitlink, the famous 160000 mode. – VonC Jul 23 '16 at 00:01
Actually I just remembered a correction, directory would be 040000, gitlink is 160000 which is `S_IFDIR|S_IFLNK`, which cannot occur in an actual inode mode. It's not clear from the Git source why 040000 becomes 160000: Git *should* be able to store empty directories! There's one obvious problem with the index, but the magical changing mode is something else. – torek Jul 23 '16 at 00:09
@torek yes, technically, Git was always able to store empty folders... it just never resolved itself to do so, focusing on *content* first. – VonC Jul 23 '16 at 00:10

score 3 · Answer 2 · edited May 23 '17 at 12:22

To expound a bit on VonC's answer, note that the git diff documentation says this about -C and --find-copies-harder:

For performance reasons, by default, -C option finds copies only if the original file of the copy was modified in the same changeset. This flag makes the command inspect unmodified files as candidates for the source of copy. This is a very expensive operation for large projects, so use it with caution. Giving more than one -C option has the same effect.

The reason this is "very expensive" is the approximate-match code. Before reading any further, think about what you see in a typical diff:

diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 96d208d..80b2387 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -347,7 +347,7 @@ test_lazy_prereq TAR_HUGE '
        test_cmp expect actual
 '

-test_expect_success 'set up repository with huge blob' '
+test_expect_success LONG_IS_64BIT 'set up repository with huge blob' '
        obj_d=19 &&
        obj_f=f9c8273ec45a8938e6999cb59b3ff66739902a &&
        obj=${obj_d}${obj_f} &&

Here we have—well, Junio Hamano has—deleted the original text_expect_success line and inserted a new, slightly different test_expect_success line.

Well, really, he's just inserted LONG_IS_64BIT within the line. We've just glossed over a really big subject, which is: how do we choose "chunks" for diffing and for displaying? We pretend the right answer is "one line at a time" here. Internally, Git needs to make various different choices, and it does not always make the same choice, or make it consistently; it's done more ad-hoc, whatever works well for whatever purpose.

Given a diff—finding one is itself somewhat tough, somewhat compute-intensive; the naive algorithm is O(MN) and the Myers algorithm is roughly O(ND)—we can then define a similarity index: how much of the original t/t5000-tar-tree.sh is in the new t/t5000-tar-tree.sh, and how much is different? Git defines this as:

amount of addition/deletions compared to the file’s size

(again, from the git diff documentation). See also Edward Thomson's answer here, which details the computation of the similarity index. For performance reasons it's done more as a kind of stripped-down diff, rather than a full diff.

There's one other important performance trick: exact matches (100% similar files) are handled up front simply by comparing the blob hashes. This obviates the need to run any diff, even a stripped-down one. Since files have exact matches when the change is just a rename, we get to do this very fast.

Diffing trees

Consider what git diff has to do here. We're comparing two directory-trees, which we can call "left" (usually the old version) and "right" (new version). The left tree has some files: a.txt, b.txt, and so on (possibly in sub-trees like dir/e.txt). The right tree has some files as well.

Suppose that the left and right tree all have the same files, by name. In this case, we probably just want to stop here, and declare that no files were renamed and no files were copied; we can probably just go on to diff the contents of each file, one pair at a time. This will be Git's default action.

Again, this is the default if we have no right-side files that have no left-side counterpart.
If we add -B ("break" big changes), however, we need to add another pass. We compare each already-paired file-pair (such as "left a.txt, right a.txt") pairwise. If the files are similar enough, we declare that they're the same file. If they're too different, we break the pairing, and go back to looking for renames and/or copies. The "amount of difference" that is "too different" is, of course, the argument to -B.

(This is a little bit simplified—-B actually has two arguments, one for "rewrite" detection and one for "rename" detection. The main function of this first pass is to record the similarity index. As noted earlier, for performance reasons, this similarity index computation does a different diff than the one we'll see as a patch.)

Now we're back to the case where we have may have right-side files that have no left-side "original" version. Whether or not we have any such right-side files, we may, at this point, have left-side files that have no right-side version. This is where -C and -M come in.

Note that -M is also called --find-renames, and -C implies -M. That is, we may have just rename detection enabled, or we may have rename and copy detection enabled.

Rename detection

Let's look at rename detection first.

If it's disabled, declare all un-paired right-side files new. We're done! That was easy!
Otherwise, for each unpaired right-side file, try to find a similar-enough left-side file. This means we must diff every unpaired left-side file against this one right-side file.

Suppose we have R unpaired right-side files and L unpaired left-side files. How many comparisons will we do, assuming we don't come up with something clever? For each R we must look at every file in L, so the answer is R * L comparisons. But each comparison is a similarity-index diff, which is itself expensive.

If there are few unpaired files on either side, this is not that big a deal. But suppose there are 10,000 files on the left side. Well, it's still not a big deal as long as 9,995 of those files are already paired: now we have only 5 left-side unpaired candidates, no matter how many unpaired files we have on the right. Let's say, for thinking about it, that there are 10 of those, 10 R files, and 5 (out of 10,000) of these unpaired L files.

This is how -M works. It only looks at unpaired left-side files. So it's not too absurdly expensive: 10 R files times 5 L candidates = 50 similarity diffs. Of course, if your computer can do about 100 similarity diffs per second, that's still about a half-second of compute time.

Computers are pretty fast now, so they can probably do even more than that. And suppose the change is just a rename, so that the similarity index is 100%. Git can detect this very fast, without doing any diff at all. Git does these first, and once these are paired up, they drop out of the "hard" R set, making -M even less expensive.

Copy detection

This is also how -C works by default, more or less. I say "more or less" because instead of looking at unpaired left-side files, it looks at modified or unpaired left-side files. That is, file a.txt exists in both the left and right trees, but if the hash for a.txt on the left does not match the hash for a.txt on the right, Git adds a.txt to the left-side set. Let's say that besides the 5 unpaired L files, there are 7 modified files. So instead of 50 (5 * 10) similarity tests, Git now has to do 120 ((5 + 7) * 10) similarity tests.

Adding --find-copies-harder, however, tells Git: don't look just at unpaired or modified files, look at every left-side file. Now we have 10 new files times 10,000 old files: one hundred thousand similarity index values to compute. Even if we can compute 1000 per second, that's still 100 seconds to find 100,000 similarity index values.

Conclusion

This means one -C option, or setting diff.renames to copies (rather than just true) in your configuration, is not that expensive. Using --find-copies-harder is still pretty expensive, though.

Thanks for this, upvoted both answers, but as VonC gave the steps i'm going to mark his as the correct answer. — Mike Mellor, Jul 25 '16 at 09:51

Force git status to acknowledge a copy and a type change

2 Answers2

Diffing trees

Rename detection

Copy detection

Conclusion