108

I'm trying to find some good examples of semantic diff/merge utilities. The traditional paradigm of comparing source code files works by comparing lines and characters.. but are there any utilities out there (for any language) that actually consider the structure of code when comparing files?

For example, existing diff programs will report "difference found at character 2 of line 125. File x contains v-o-i-d, where file y contains b-o-o-l". A specialized tool should be able to report "Return type of method doSomething() changed from void to bool".

I would argue that this type of semantic information is actually what the user is looking for when comparing code, and should be the goal of next-generation progamming tools. Are there any examples of this in available tools?

Paŭlo Ebermann
  • 68,531
  • 18
  • 138
  • 203
jasonmray
  • 2,242
  • 2
  • 18
  • 14
  • 3
    Looks like there has been some research done on tree edit distance. Applying that to the AST's seems like it would be the first thing to try. (If someone wanted to try writing this kind of thing.) – Jay Kominek Feb 07 '09 at 08:40
  • 2
    I'm not sure if it would really be useful. a difference like the one you mentioned is more easily seen than read, especially if you have a tool highlighting differences *within* a line. the ability to recognize if some code has just been moved around unchanged would be easier and more useful, imho! – UncleZeiv Feb 07 '09 at 08:48
  • 2
    @UncleZeiv I would hope that feature would naturally follow from the nature of the tool. In addition, it would be able to detect that there are no changes if someone went through and changed the curly brace or indent styles, for example, or rearranged the file so static methods are grouped, etc.. – jasonmray Feb 07 '09 at 09:38
  • @Jay: Someone did. See my answer. – Ira Baxter Nov 24 '10 at 21:29
  • 8
    I need this in Visual Studio now. Forcing developers within a team to use the same formatting structure to facilitate diffs is backwards thinking. The code should be formatted to some standard on check-in, and any time a developer opens a file, it should be formatted to their liking. I'm shocked this sort of thinking isn't more wide spread at this point. – Langdon Feb 09 '12 at 03:01
  • Coming here from [git diff algorithm that does not rip functions apart? (language-aware diff)](http://stackoverflow.com/questions/24162687/git-diff-algorithm-that-does-not-rip-functions-apart-language-aware-diff). – donquixote Jul 02 '14 at 16:53
  • 3
    IMHO this is a fine topic for SO. If you agree this, vote to "reopen" – Ira Baxter Sep 26 '16 at 14:00
  • There is https://github.com/GumTreeDiff/gumtree whose algorithm has also been implemented in clang-diff: https://github.com/krobelus/clang-diff-playground – Trass3r Nov 13 '18 at 01:02

8 Answers8

37

We've developed a tool that is able to precisely deal with this scenario. Check http://www.semanticmerge.com

It merges (and diffs) based on code structure and not using text-based algorithms, which basically allows you to deal with cases like the following, involving strong refactor. It is also able to render both the differences and the merge conflicts as you can see below:

enter image description here

And instead of getting confused with the text blocks being moved, since it parses first, it is able to display the conflicts on a per method basis (per element in fact). A case like the previous won't even have manual conflicts to solve.

enter image description here

It is a language-aware merge tool and it has been great to be finally able to answer this SO question :-)

pablo
  • 6,282
  • 3
  • 39
  • 59
30

Eclipse has had this feature for a long time. It's called "Structure Compare", and it's very nice. Here is a sample screenshot for Java, followed by another for an XML file:

(Note the minus and plus icons on methods in the upper pane.)

Eclipse's Java Structure Comparer Eclipse's XML Structure Comparer

Hosam Aly
  • 38,883
  • 35
  • 132
  • 179
  • 3
    Does Structure Compare allow you to merge changes like other source control merge editors? I.e. Copy this method from this version to the other version. – Jonathan Parker Mar 08 '09 at 03:13
  • 1
    Yes, when you select a change or a difference (either in the upper or lower panes), the toolbar buttons (shown in the screenshots) give you the option to copy the change from left to right or vice versa. – Hosam Aly Mar 08 '09 at 08:03
  • 1
    Unfortunately, the screenshots are no longer visible in your (highest-upvoted and accepted!) answer. Could you submit them again? – blubb Apr 23 '13 at 07:27
  • @blubb Thanks for notifying me. I've corrected the error with the Java Comparer image. I'll try to add a screenshot for the XML Structure Comparer soon. – Hosam Aly Apr 23 '13 at 11:41
  • Does not work for nested classes... – leppie Sep 30 '13 at 11:10
  • 1
    And does that work for languages other than Java? – einpoklum Mar 20 '17 at 09:36
14

To do "semantic comparisons" well, you need to compare the syntax trees of the languages, and take into account the meaning of symbols. A really good semantic diff would understand the language semantics, and realize when one block of code was equivalent in function to another. Going this far requires a theorem prover, and while it would be extremely cute, isn't presently practical for a real tool.

A workable approximation of this is simply comparing syntax trees, and reporting changes in terms of structures inserted, deleted, moved, or changed. Getting somewhat closer to a "semantic comparison", one could report when an identifier is changed consistently across a block of code.

See our http://www.semanticdesigns.com/Products/SmartDifferencer/index.html for a syntax tree-based comparison engine that works with many languages, that does the above approximation.

EDIT Jan 2010: Versions available for C++, C#, Java, PHP, and COBOL. The website shows specific examples for most of these.

EDIT May 2010: Python and JavaScript added.

EDIT Oct 2010: EGL added.

EDIT Nov 2010: VB6, VBScript, VB.net added

Ira Baxter
  • 88,629
  • 18
  • 158
  • 311
  • 2
    Hi Ira, have you published a paper on your diff algorithm? I'm having trouble finding tree-edit distance diff literature. Thanks,Terence. – Terence Parr Nov 20 '10 at 18:18
  • To be more specific, looking for diff3 not plain diff2 – Terence Parr Nov 20 '10 at 18:26
  • 2
    @Terence: No publication exists of our diff algorithm. It is a Levenstein min distance computation using suffix trees to identify equal subtrees, with some huerstics to handle renaming. IIRC, Yang had a paper on this in Software Practice and Experience. Ours and Yang's are diff2, not diff3. – Ira Baxter Nov 20 '10 at 20:47
  • @IraBaxter The link is currently broken and site seems to be down when opening from google link. – Răzvan Flavius Panda Aug 28 '17 at 12:43
  • Site is back up, link should be OK. – Ira Baxter Aug 31 '17 at 11:23
12

What you're groping for is a "tree diff". It turns out that this is much harder to do well than a simple line-oriented textual diff, which is really just the comparison of two flat sequences.

"A Fine-Grained XML Structural Comparison Approach" concludes, in part with:

Our theoretical study as well as our experimental evaluation showed that the proposed method yields improved structural similarity results with respect to existing alternatives, while having the same time complexity (O(N^2))

(emphasis mine)

Indeed, if you're looking for more examples of tree differencing I suggest focusing on XML since that's been driving practical developments in that area.

Răzvan Flavius Panda
  • 20,376
  • 13
  • 101
  • 153
bendin
  • 9,096
  • 1
  • 37
  • 37
  • Thanks for the link. I can think of a few different approaches for implementing sematic diff tools, and you are correct -- most can be abstracted into a "tree diff". More complex situations may even need to be abstracted into a "graph diff". – jasonmray Mar 07 '09 at 20:56
  • Yea. IBM's Rational Modeler (built on eclipse) tries to do this with UML models (showing the differences between two models graphically). I can't comment on the usefulness of the results as I don't use it much. – bendin Mar 07 '09 at 20:59
  • I agree that XML is a good place to start, as you can simply come up with schemas to represent other structures (such as java code, for example), and use an XML based tree-diff to implement a code diff. – jasonmray Mar 07 '09 at 20:59
  • "do this" => do something akin to a "graph diff". – bendin Mar 07 '09 at 21:00
  • 1
    See http://www.semdesigns.com/Products/SmartDifferencer/index.html for a syntax tree-based comparison engine that works with many languages. – Ira Baxter Jun 17 '09 at 09:48
5

Shameless plug for my own project:

HTML Tree Diff does structure-aware comparison of xml and html documents, written in python.

http://pypi.python.org/pypi/html-tree-diff/0.1.0

Christian Oudard
  • 42,650
  • 23
  • 62
  • 69
2

A company called Zynamics offers a binary-level semantic diff tool. It uses a meta-assembly language called REIL to perform graph-theoretic analysis of 2 versions of a binary, and produces a color-coded graph to illustrate differences between them. I am not sure of the price, but I doubt it is free.

David V McKay
  • 114
  • 2
  • 11
  • Link to binary-level semantic diff: https://www.zynamics.com/bindiff.html – emallove Jan 19 '17 at 16:17
  • bindiff is now free, and binnavi (their other product) is open source. It appears that REIL is included in the binnavi release - https://github.com/google/binnavi/tree/master/src/main/java/com/google/security/zynamics/binnavi/REIL – Mark Aug 28 '20 at 01:39
2

The solution to this would be on a per language basis. I.e. unless it's designed with a plugin architecture that defers a lot of the parsing of the code into a tree and the semantic comparison to a language specific plugin then it will be very difficult to support multiple languages. What language(s) are you interested in having such a tool for. Personally I'd love one for C#.

For C# there is an assembly diff add-in to Reflector but it only does a diff on the IL not the C#.

You can download the diff add-in here [zip] or go to the project on the codeplex site here.

Jonathan Parker
  • 6,497
  • 3
  • 38
  • 54
  • 1
    See http://www.semdesigns.com/Products/SmartDifferencer/index.html for a syntax tree-based comparison engine that works with many languages, using exactly the language plugin style. Not released yet, but a C# version is very close. – Ira Baxter Jun 17 '09 at 09:49
  • Jan 2010: C# Smart Differencer is released. – Ira Baxter Feb 25 '10 at 09:49
2

http://prettydiff.com/

Pretty Diff minifies each input to remove comments and unnecessary white space and then beautifies the code prior to the diff algorithm. I cannot think of anyway to become more code semantic than this. And, its written JavaScript so it runs directly in the browser.

austincheney
  • 749
  • 9
  • 9
  • 6
    Then you have a limited imagination! What about swapping the positions of two methods in a file while leaving them unchanged? What about refactorings? – Robin Green Jul 02 '11 at 18:13
  • (You can't swap around data declarations in Java this way, and still have equivalence, due to initializers; I assume C# has similar troubles). If you go for pure semantic diff, then you are trying to solve Turing machine equivalence. There a lot of range for doing better then pure text matching, and worse than Turing impossible. – Ira Baxter Dec 19 '12 at 18:51
  • @IraBaxter The tool conceptually will obviously only show as equivalent things which actually are equivalent. If properly coded it won't have the issue type you are mentioning. – Răzvan Flavius Panda Aug 28 '17 at 12:49
  • "Properly coded" means proving algorithm equivalence if you want the ultimate tool. Algorithm equivalence proofs are Turing-hard in general, so you aren't going to get such a tool in practice. What you might get is a tool that handles *some* equivalences other than just syntax changes. To date, I've not seen anybody attempt to build such a tool. – Ira Baxter Aug 31 '17 at 11:35