Decompilers analyze binary code outputting source code in a higher level language such as C. The output is generally not any easier to analyze than the original assembler due to loss of information during compilation.
The concept of a decompiler seems simple to most people. A compiled binary was created from source code, so the operation seems like it should be reversible. However, there are some challenges that a decompiler faces:
- Decomposing assembler to a basic block.
- Lose of information during compilation.
Decomposing Basic blocks
Hand crafted assembler may confound analysis into a basic block, which will prohibit the creation of a control flow graph. For example, hand crafted assembler is not bound to follow a function prologue and epilogue. Assembler may make use of instructions that do not map to a higher level language. It may use self-modifying code and multiple entry points (even mid-instruction) for legitimate purposes or to foil reverse engineering. Aggressive compiler optimization may produce the same effects under some cases.
Loss of information
Comment and variable names are obviously lost information in the decompilation process. As well, compilers aggressively optimize code; a key part being to keep high level variable in registers. Because of this, a register maybe re-used for many different high level variable. This may result in the decompiled code have a different amount of variables and control structure from the original code. Also, different compilers (or even different optimization levels) generate different code for the same source code. Ie, the source to machine mapping is compiler dependent. Without hints to the decompiler, it cannot generically re-generate the same source. Often the decompiled code will resemble obfuscated code.
Cristina Cifuentes's research paper from Queensland University of Technology give more technical details of a decompiler. The Boomerang project is an example of an Open Source decompiler.
Some general uses of a decompiler:
- Retargetting code to a different instruction set.
- Analyzing a binary for security issue.
- Patching code for an operating system update.
Due to the loss of information, decompiled code may not assist in understanding assembler code. It certainly can not produce the original source code. Examining decompiled code can give an appreciation of good variable naming.
See also: disassembling reverse-engineering