How do I identify & list unique hunks in a git commit?

Question

I have a commit with a large number (hundreds) of similar hunks, and I'd like to list each unique hunk in the commit in order to compare them.

I wrote the following GNU awk script, which writes each hunk to a unique file (hunk-[md5-of-hunk].txt):

BEGIN {
  hunk = ""
  buildhunk = 0
}

function writeHunk() {
  if (length(hunk) > 0) {
    print hunk > "hunk.tmp"
    close("hunk.tmp")
    cmd = "cat hunk.tmp | md5"
    cmd | getline md5
    close(cmd)
    if (!(md5 in hunkfiles)) {
      hunkfilename = "hunk-" md5 ".txt"
      print hunk > hunkfilename
      hunkfiles[md5] = hunkfilename
    }
  }
}

/^@@|^diff/ {
  writeHunk()
  hunk = ""
  buildhunk = ($1 == "@@") ? 1 : 0
}

/^[ +-]/ {
  if (buildhunk) {
    hunk = hunk $0 "\n"
  }
}

END {
  writeHunk()
  system("rm hunk.tmp")
  for (md5 in hunkfiles) {
    print hunkfiles[md5]
  }
}

I then run this with git show [commit-SHA] | awk -f my_script.awk, which creates & lists the resulting files. It works for my purposes, but is there a way to do this more efficiently using git's plumbing commands.

Example

Suppose the commit's patch looks like this (reduced to 1 line of context below for clarity's sake):

diff --git a/file1.txt b/file1.txt
index a3fb2ed..4d6f587 100644
--- a/file1.txt
+++ b/file1.txt
@@ -3,2 +3,3 @@ context
 context
+added line
 context
@@ -7,2 +8,3 @@ context
 context
+added line
 context
@@ -11,2 +13,3 @@ context
 context
+added line
 context
@@ -15,2 +18,3 @@ context
 context
+different added line
 context
@@ -19,2 +23,3 @@ context
 context
+different added line
 context
@@ -23,2 +28,3 @@ context
 context
+different added line
 context
@@ -27,2 +33,3 @@ context
 context
+even more different added line
 context
@@ -31,2 +38,3 @@ context
 context
+even more different added line
 context

I want to be able to identity that there are only 3 unique hunks, and see what they are. Namely:

Unique hunk 1:

 context
+added line
 context

Unique hunk 2:

 context
+different added line
 context

Unique hunk 3:

 context
+even more different added line
 context

There is a [similar question](http://stackoverflow.com/questions/31993074/stage-hunk-non-interactively-in-git), but it never got answered. — Benjamin W., May 17 '17 at 04:46
@BenjaminW. thanks, I looked at that question, and it's similar in that both concern hunks, but my question is more about identifying & listing unique hunks in a commit, while that question is about non-interactively adding hunks in some scripted way. I'll update my question to clarity. — tavnab, May 17 '17 at 04:49
I agree, that's why I didn't mark as a duplicate, but I think a solution that programmatically gives you your hunks separately could be used to solve that other problem as well. — Benjamin W., May 17 '17 at 04:50
@BenjaminW. Agreed & appreciated :) I have a feeling git already does this somewhere in its plumbing. I'm hoping someone who knows git's internals well enough has some simple one-liner to list a commit's unique hunks. — tavnab, May 17 '17 at 05:01

score 2 · Accepted Answer · answered May 17 '17 at 05:38

Commits are snapshots, and as such, they don't have diff hunks.

Diffs, of course, do have diff hunks. So if you have just one commit, you cannot do this at all. You need two commits. You then simply diff them and do what you are doing.

Note that git show <commit-hash> really means git diff <parent or parents of commit> <commit-hash>. If the specified commit is a merge commit, this produces a combined diff, which is probably not useful for your purposes since combined diffs intentionally omit many changes entirely. You might want to run an explicit diff against the commit's first parent only (to view only changes brought in as part of the merge).

There are some parts of Git that internally do something like what you're doing, for git rerere and git patch-id. However, they don't do exactly what you're doing: for rerere they record only diff hunks where there was a merge conflict, and match up those diff hunks (saved by hash ID and file name) with resolutions recorded later. For patch-id they strip off line numbers and white-space but accumulate the entire set of changes from a commit into one big piece. It might be nice if Git had a bit of plumbing that did the git patch-id part hunk by hunk, independent of computing the overall patch ID for the commit, but it doesn't.

thanks! You're of course right that commits don't have hunks in their own right, but rather the hunks are a product of comparing 2 commits (and that the hunks themselves are context-sensitive). Supposing I isolate the hunks myself, do you see any advantage of using `patch-id` in place of md5 to ID the hunks? — tavnab, May 17 '17 at 06:42
@tavnab: Might be a bit faster, might not. See whether you like what it does to white space. Also you'll get sha-1s rather than md5s, so you get a few more bits of hash (160 bits, vs 128). None of these seem big arguments one way or another, except for treatment of white space. — torek, May 17 '17 at 06:46
I think I'll stick with md5 for now, but good to know I have the whitespace-stripping option of `patch-id` in case I need it. — tavnab, May 17 '17 at 06:50

How do I identify & list unique hunks in a git commit?

Example

1 Answers1