6

I am sorry - C++ source code can be seen as implementation of a design, and with reverse-engineering I mean getting the design back. It seems most of you have read it as getting C++ source from binaries. I have posted a more precise question at Understanding a C++ codebase by generating UML - tools&methology


I think there are many tools that can reverse-engineer C++ (source-code), but usually it is not so easy to make sense of what you get out.

Have somebody found a good methodology?

I think one of the things I might want to see for example is the GUI-layer and how it is separated (or not from the rest). Think the tools should somehow detect packages, and then let me manually organize it.

Community
  • 1
  • 1
Olav
  • 1,659
  • 4
  • 26
  • 47

4 Answers4

9

To my knowledge, there are no reliable tools that can reverse-engineer compiled C++.

Moreover, I think it should be near impossible to construct such a device. A compiled C++ program becomes nothing more than machine language instructions. In order to kn ow how that's mapped to C++ constructs, you need to know the compiler, compiler settings, libraries included, etc ad infinitum.

Why do you want such a thing? Depending on what you want it for, there may be other ways to accomplish what you're really after.

John Dibling
  • 94,084
  • 27
  • 171
  • 303
  • 1
    It's also not a bijection even given that compiler info: by the time a few function template calls have been inlined, instantiated with different types, it's impossible to know, and may be extremely difficult to guess, that they really all came from the same template in the first place. Unless the binary has debug info, of course. – Steve Jessop Nov 24 '10 at 00:09
  • Pretty much agreed- you can't realistically reverse engineer C++. – Puppy Nov 24 '10 at 00:10
  • You guys are assuming that the goal is to restore the original source code. It is hypothetically possible to get back something functionally equivalent (even if this means that template instantiations will look like independent types and functions). However, there exists no tool that does this well at the moment. – Evan Teran Nov 24 '10 at 03:42
  • @Evan Teran: check out the Hex-Rays decompiler. – Igor Skochinsky Nov 29 '10 at 14:10
  • @Igor: Read my answer, I recomended Hex-Rays to the OP... – Evan Teran Nov 29 '10 at 17:55
  • 1
    Well, I would say Hex-Rays does the job of "getting back something functionally equivalent" pretty well (with some help from the user), and it's not "hypothetical". – Igor Skochinsky Dec 01 '10 at 11:52
3

You can pull control flow with dissembly but you will never get data types back...

There are only integers (and maybe some shorts) in assembly. Think about objects, arrays, structs, strings, and pointer arithmetic all being the same type!

nate c
  • 7,936
  • 2
  • 23
  • 25
  • 3
    "you will never get data types back" - I wonder. vtables might be quite recognisable, and probably anything with external linkage will have names that can be demangled. You might be able to figure out a reasonable amount about many classes, but what you can't do is find all the calls to that class, since in general some will be inlined. – Steve Jessop Nov 24 '10 at 01:59
  • 1
    @Steve: The names aren't there. All that is left is memory addresses. – John Dibling Nov 24 '10 at 02:23
  • @Steve Jessop: inline functions aren't actually an issue, that will just appear to the "decompiler" as multiple functions with repeated code. Sure it won't look like the original source, but it may be functionally equivalent with is what really matters for reverse engineering. – Evan Teran Nov 24 '10 at 03:35
  • @Evan: OK, but on that basis you can "reverse-engineer" C++ code by decompiling it to assembler or C. – Steve Jessop Nov 24 '10 at 11:14
  • @John: well then how come `dlopen` works on programs? I don't mean that the names are in the vtable, just that they're probably in the symbol table for the executable, so given the addresses you can look them up in reverse. The executable *may* have had external symbols stripped, of course. – Steve Jessop Nov 24 '10 at 11:15
  • @Steve: sure, that could be a first step, though the reader of the assembly has to be skilled enough to know what they are looking at. The point of reverse engineering is develop an understanding of what the code does. Certainly this can be extracted from the ASM itself. – Evan Teran Nov 25 '10 at 18:58
3

While it isn't a complete solution. You should look into IDA Pro and Hexrays.

It is more for "reverse engineering" in the traditional sense of the phrase. As in, it will give you a good enough idea of what the code would look like in a C like language, but will not (cannot) provide fully functioning source code.

What it is good for, is getting a good understanding of how a particular segment (usually a function) works. It is "user assisted", meaning that it will often do a lot of dereferences of offsets when there is a really a struct or class. At which point, you can supply the decompiler with a struct definition (classes are really just structs with extra things like v-tables and such) and it will reanalyze the code with the new type information.

Like I said, it isn't perfect, but if you want to do "reverse engineering" it is the best solution I am aware of. If you want full "decompilation" then you are pretty much out of luck.

Evan Teran
  • 80,654
  • 26
  • 169
  • 231
1

The OovAide project at http://sourceforge.net/projects/oovaide/ or on github has a few features that may help. It uses the CLang compiler for retrieving accurate information from the source code. It scans the directories looking for source code, and collects the information into a smaller dataset that contains the information needed for analysis.

One concept is called Zone Diagrams. It shows relationships between classes at a very high level since each class as shown as a dot on the diagram, and relationship lines are shown connecting them. This allows the diagrams to show hundreds or thousands of classes. The OovAide program zone diagram display has an option call "Show Child Zones", which groups the classes that are within directories closer to each other. There are also directory filters, which allow reducing the number of classes shown on a diagram for very large projects. An example of zone diagrams and how they work is shown here: http://oovaide.sourceforge.net/articles/ZoneDiagrams.html

If the directories are assigned component types in the build settings, then the component diagram will show the dependencies between components. This even shows which components are dependent on external components such as GTK, or other external libraries.

The next level down shows something like UML class diagrams, but shows all relations instead of just aggregation and inheritance. It can show classes that are used within methods, or classes that are passed as parameters to methods. Any class can be chosen as a starting point, then before a class is added the diagram, a list is displayed that allows viewing which classes will be displayed by a relationship type.

The lowest level shows sequence diagrams. This allows navigating up or down the call tree while showing the classes that contain the methods.