7

I'm not too deeply rooted in the very formal side of static code analysis, hence this question.

A couple of years ago I read that distinguishing code from data using static code analysis is equivalent to the Halting Problem. (Citation needed, but I don't have it anymore. Stackoverflow has threads on this here or here.) At least for common computer architectures based on the Von Neumann architecture where code and data share the same memory this seemed to make sense.

Now I'm looking at the static analysis of C/C++ code and pointer analysis; the program does not execute. Somehow I have a feeling that tracking all creations and uses of pointer values statically is similar to the Halting Problem because I can not determine if a given value in memory is a pointer value, i.e. I can not track the value-flow of pointer values through memory. Alias analysis may narrow down the problem, but it seems to become less useful in the face of multi-threaded code.

(One might even consider tracking arbitrary values, not just pointers: constructing a complete value-flow for any given "interesting" value seems equivalent to the Halting Problem.)

As this is just a hunch, my question is: are the more formal findings on this that I can refer to? Am I mistaken?

Community
  • 1
  • 1
Jens
  • 6,642
  • 6
  • 45
  • 64
  • @Dukeling: Well, _static_ analysis seems to imply that the program is _not_ running. I think that the OP is trying to analyze the program looking at the C/C++ source code. Or maybe the generated assembler? – rodrigo Dec 30 '13 at 22:59
  • [How to ask about the 'halting problem' on StackOverflow, regarding the title filter](http://meta.stackexchange.com/questions/192515/how-to-ask-about-the-halting-problem-on-stackoverflow-regarding-the-title-fil?rq=1) - no useful answers there though :( – Digital Trauma Dec 30 '13 at 23:08
  • The premise of this question is trumped by the intent. You use static analysis to discover problems in code, undefined behavior in particular. So your starting point is *assuming* that you have a mal-formed program and you cannot make any assumptions that language rules apply. You cannot formulate a theory on a foundation without rules. – Hans Passant Dec 30 '13 at 23:41
  • @HansPassant: not sure I grok what you mean? – Cheers and hth. - Alf Dec 31 '13 at 00:00
  • @HansPassant: I think that's irrelevant. Even if I assume a correct program, I still seem to be unable to track all pointer values conclusively. – Jens Dec 31 '13 at 02:26
  • 3
    @HansPassant Your remark containing the word “assuming” is easily circumvented by limiting oneself to static analyzers that discover the first Undefined Behavior in an execution of the C program on the “abstract machine” that is supposed to be defined by the C standard. You are right that predicting anything after the first UB is unreasonable. There are two articles that cover this in some detail in the proceedings of http://compare2012.verifythis.org/ . You should recognize them easily if interested (or just look for the article of which I am a co-author). – Pascal Cuoq Dec 31 '13 at 02:58

4 Answers4

3

You can always code up this:

extern bool some_program_halts();
extern int* invalid_pointer();

#include <iostream>
int main()
{
    using namespace std;
    if( some_program_halts() ) { cout << *invalid_pointer() << endl; }
}

Checking whether this program dereferences the invalid pointer is equivalent to finding out whether the call to some_program_halts(), uh, halts.

Cheers and hth. - Alf
  • 135,616
  • 15
  • 192
  • 304
  • Oh, by the way, just because!, a link to my ["Ironclad proof that you are smarter than Roger Penrose"](http://alfps.wordpress.com/2010/06/03/an-ironclad-proof-that-you-are-smarter-than-roger-penrose/). He he. :) – Cheers and hth. - Alf Dec 30 '13 at 23:27
  • Agreed. Now let's assume a closed world (where we do have access to all code, and no code is added at runtime. :) – Jens Dec 31 '13 at 02:31
  • @Jens: I'm sorry, I don't understand. – Cheers and hth. - Alf Dec 31 '13 at 02:41
  • 1
    The above example calls out to two `extern` functions, i.e. functions whose behavior can not be examined because their definition is missing. That's an open world scenario. In a closed world scenario, all possible (i.e. called/referenced) functions are defined and available to the analysis. – Jens Dec 31 '13 at 21:39
  • 1
    Oh I didn't mean to imply any missing definitions. From a C++ point of view the keyword `extern` is technically redundant in the above code. You can just remove it if it seems to indicate "missing". ;-) – Cheers and hth. - Alf Dec 31 '13 at 21:47
1

It's almost certainly equivalent, modulo the fact that C is not a turing-equivalent language (a given C implementation is a gigantic finite state machine rather than a turing machine, due to the Representation of Types). Pointers need not be kept in their original representations in objects whose effective type is pointer type; you can examine the representation and perform arbitrary operations on it, for example, encrypting pointers and decrypting them later. Determining whether an arbitrary computation is reversible, or whether two computations are inverses of one another, is (offhand) probably equivalent to determining halting.

R.. GitHub STOP HELPING ICE
  • 195,354
  • 31
  • 331
  • 669
  • 1
    re " C is not a turing-equivalent language", if you mean "turing complete" then that's wrong. the core language does not define any i/o facilities or means to interface with hardware or anything, so as a statement about the core language it's literally true, but totally irrelevant. once you add i/o, however, you get *infinite memory*. and that observation was made long before Turing's time, even by George Boole. – Cheers and hth. - Alf Dec 30 '13 at 23:06
  • 1
    C does not define IO facilities that allow addressing unlimited storage; they may exist as an implementation extension, but not something usable by a strictly conforming program. – R.. GitHub STOP HELPING ICE Dec 30 '13 at 23:17
  • @R: Oh, the old overwhelm-with-assertions thing. but ok. the first part of the sentence, "C does not define IO facilities that allow addressing unlimited storage" is trivially wrong. Anyone can write a program that accesses unlimited storage. Given such an infinity of counter-examples it's hard to imagine how you thought anyone would believe the assertion. Re "they may exist as an implementation extension", certainly, that also, but the standard library already covers it. "but not something usable by a strictly conforming program" is incorrect. in short, wrong, irrelevant, wrong. – Cheers and hth. - Alf Dec 30 '13 at 23:25
  • 1
    POSIX claims not to conflict with ISO C, and requires `fseek` to fail with `EOVERFLOW` if the resulting offset is not representable in type `long`. Assuming this behavior conforms with ISO C (you may dispute that), that destroys one of your best chances of achieving unlimited storage, though perhaps you could argue that it's recoverable by using `rewind` and repeated `getc` to position yourself. However even then, the number of times you call `getc` is limited by the largest number you can otherwise represent; you could apply the principle of using files to store "unlimited size" data... – R.. GitHub STOP HELPING ICE Dec 31 '13 at 00:04
  • ...recursively, but eventually you hit the fact that the number of recursion levels you can have is limited by `CHAR_BIT*sizeof(void *)` since each object has a unique address and addresses have representations per *Representation of Types*. The size of the state machine is probably a lot larger than the naive 2^BITS_OF_RAM guess, but I claim it's still ultimately finite, and the finitude derives fully from *Representation of Types*. – R.. GitHub STOP HELPING ICE Dec 31 '13 at 00:06
  • @R: the old "proof that doors can not open by example of one's inability to imagine how to open a door". well. i don't know what to say, it's so ludicrous. maybe ... you're not *aware* that the standard library provides *inifnite streams*? in C they're called `stdin`, `stdout` and `stderr`. – Cheers and hth. - Alf Dec 31 '13 at 00:20
  • 1
    @Cheersandhth.-Alf: They are not infinite streams. They're arbitrary stdio FILE streams that, depending on the implementation, may be attached to the same sort of seekable files you obtain with `fopen`, interactive devices, or something else. In the cases where they're associated with a terminal or pipe, they are entirely one-way and provide *no storage at all* since the program can never read back what it wrote. – R.. GitHub STOP HELPING ICE Dec 31 '13 at 01:46
  • Your assertions do not stop any novice from writing a standard C program that accesses infinite storage. In your previous response you committed the fallacy of "example of doing something else than X, hence X is impossible" (not sure what it's called or if it was used enough in the old days to have a name). Now you have *repeated* that fallacy. Can't you turn to new fallacies. Please. – Cheers and hth. - Alf Dec 31 '13 at 02:20
  • 2
    @Cheersandhth.-Alf Infinite streams `stdin` and `stdout` make a terrible offhand example of an infinite ribbon, since a Turing Machine infinite ribbon is something that one can write to and read back the same thing from. One can, however, interface a C program with an infinite ribbon. Such a composite C program + ribbon would be an appropriate Turing Machine. This is not a very insightful statement to make, however, as a finite automaton interfaced with an infinite ribbon is a Turing Machine too. This does not tell us that a C program is more than a finite automaton. – Pascal Cuoq Dec 31 '13 at 02:53
  • @PascalCuoq: assessment of turing completeness always assumes "can have" about the infinite part. and so your insight that a turing machine is a finite automaton with infinite memory, while a C program is a finite automation which can have infinite memory, is, perhaps unintended, very much void of meaning, nonsense. did you mean to write something that you forgot to write? – Cheers and hth. - Alf Dec 31 '13 at 03:06
  • @PascalCuoq: sorry, your statement "a finite automaton interfaced with an infinite ribbon is a Turing Machine too" is incorrect *if it is assumed to be meaningful* (talking about turing completeness). only a small subset of possible finite automatons are turing complete. the smallest known such is very small indeed, with just a few states, but as I recall out of the 256 machines that could be generated with that particular pattern, only 2 were turing complete – Cheers and hth. - Alf Dec 31 '13 at 03:17
  • @R. And when you revenge-downvote, as it appears that you've done, please do state your downvote reason so that others can learn from it. – Cheers and hth. - Alf Dec 31 '13 at 03:27
  • @Cheersandhth.-Alf My statement that "a finite automaton interfaced with an infinite ribbon is a Turing Machine too" was an informal allusion to the definition of a Turing Machine as a finite action table together with an infinite ribbon. Any action table can be part of a Turing Machine. I am sorry if it hurts your intuition that not all thus defined Turing Machines are Turing-Complete, but I did not choose the definitions. – Pascal Cuoq Dec 31 '13 at 03:51
  • @PascalCuoq: you are misleading readers with your statement "I am sorry if it hurts your intuition that not all thus defined Turing Machines are Turing-Complete". Contrary to what you indicate, I have stated that "only a small subset of possible finite automatons are turing complete", in response to your indication that it was not. In addition to **misleading** readers you are effectively drowning things in **noise**. Following on the heels of a long sequence of fallacies and evasions by the "R" nick, what am I to conclude from your misleading no sense statements? – Cheers and hth. - Alf Dec 31 '13 at 03:58
  • -1 For trivially incorrect statement that "C is not a turing-equivalent language", and severe unwillingness to correct, including a stream of evasions and fallacies – Cheers and hth. - Alf Dec 31 '13 at 04:02
  • @Cheersandhth.-Alf: I did not downvote your answer; I didn't even vote on it at all, as it's rather orthogonal to the discussion we've had here. – R.. GitHub STOP HELPING ICE Dec 31 '13 at 04:10
0

If I understood you correctly: yes, checking whether a C or C++ program accesses an invalid pointer is equivalent to the halting problem (of a C or C++ program, in any case).

Suppose you had a tool that told you whether a program accessed an invalid pointer, and a program you wanted to check for halting. By adding extra information to each pointer you can make it checkable (at runtime) whether the pointer is valid or not; add such checks, with an infinite loop on failure. You now have a program with no invalid pointer accesses. By replacing all places the program can terminate with an invalid pointer access you get a program which has an invalid pointer access if and only if the original program terminates.

Anton Golov
  • 3,224
  • 1
  • 18
  • 39
  • I don't want to check if a program accesses an invalid pointer, I want to track the *flow* of a pointer value. It seems to me that tracking that flow from where a pointer value is created to all of its possible uses is equivalent to the Halting Problem because I can not track values statically through memory. – Jens Jan 01 '14 at 19:40
  • @Jens: Would this include finding all values? If so, then a similar argument works: assuming you can find all uses of each pointer in your program, add a new pointer, and make it used at every exit. If your detector finds the pointer has been used, your program terminates, otherwise it doesn't. – Anton Golov Jan 01 '14 at 22:14
  • Not *all* values, just select ones. But for those select ones I would want to find all uses. And that is hard considering that the value-flow of values in C/C++ is statically not completely trackable; I can only ever approximate. – Jens Jan 03 '14 at 02:03
  • Okay, well, the algorithm I described above would work for checking for halting, so yes, your problem is equivalent to the halting problem. – Anton Golov Jan 03 '14 at 02:06
0

Static analysis is almost always an approximation, often provable by reduction to the halting problem with programs like the one in Alf's answer. However, the approximation can err on the side of either false positives or false negatives.

  • A "conservative" static check will only have false negatives. It will never accept a "bad" program, but it will inevitably reject some "sufficiently complicated" good programs.
  • A "liberal" static check will have false positives. Sometimes it accepts a bad program by mistake but (generally) it will also accept all good programs.

Some examples:

  • Java's type system is conservative: a variable with a type T will always contain an instance of type T (or a subtype of T or null) at runtime no matter what.
  • GCC's option to warn about uninitialized variables is liberal: it doesn't find all potential uses of an uninitialized variable. Here's an example of a false positive program.
  • In contrast, Java does a conservative uninitialized variable check for local variables. It refuses to compile the program if it sees any potential execution path using a potentially uninitialized variable.

Liberal checks are often used by compilers to emit warnings and by external static analysis tools. Things like type systems and compiler optimizations tend to rely on conservative checks to be correct.

Many tasks have several reasonable conservative and liberal algorithms of varying accuracy. Alias analysis is certainly one of these.

For more information, see any good compiler textbook, such as the dragon book.

Community
  • 1
  • 1
mpartel
  • 4,320
  • 1
  • 22
  • 30