4

Sorry for the vagueness of this question's title, but I'm not sure how to ask this exactly.

The following code, when executed on an Arduino microprocessor (c++ compiled for an ATMega328 microprocessor) works fine. Return values shows in comments in the code:

// Return the index of the first semicolon in a string
int detectSemicolon(const char* str) {

    int i = 0;

    Serial.print("i = ");
    Serial.println(i); // prints "i = 0"

    while (i <= strlen(str)) {
        if (str[i] == ';') {
            Serial.print("Found at i = ");
            Serial.println(i); // prints "Found at i = 2"
            return i;
        }
        i++;
    }

    Serial.println("Error"); // Does not execute
    return -999;
}

void main() {
    Serial.begin(250000);
    Serial.println(detectSemicolon("TE;ST")); // Prints "2"
}

This outputs "2" as the position of the first semicolon, as expected.

However, if I change the first line of the detectSemicolon function to int i; i.e. without the explicit initialisation, I get problems. Specifically, the output is "i = 0" (good), "Found at i = 2" (good), "-999" (bad!).

So the function is returning -999 despite having executed the print statement immediately before a return 2; line and despite never executing the print statement immediately before the return -999; line.

Can someone help me to understand what's happening here? I understand that variables inside functions in c can theoretically contain any old junk unless they're initialised, but here I'm specifically checking in a print statement that this hasn't happened, and yet...


EDIT: Thanks to everyone who's chipped in, and particularly to underscore_d for their great answer. It seems like undefined behaviour is indeed causing the compiler to just skip anything involving i. Here's some of the assembly with the serial.prints within detectSemicolon commented out:

void setup() {
    Serial.begin(250000);
    Serial.println(detectSemicolon("TE;ST")); // Prints "2"
  d0:   4a e0           ldi r20, 0x0A   ; 10
  d2:   50 e0           ldi r21, 0x00   ; 0
  d4:   69 e1           ldi r22, 0x19   ; 25
  d6:   7c ef           ldi r23, 0xFC   ; 252
  d8:   82 e2           ldi r24, 0x22   ; 34
  da:   91 e0           ldi r25, 0x01   ; 1
  dc:   0c 94 3d 03     jmp 0x67a   ; 0x67a <_ZN5Print7printlnEii>

It looks like the compiler is actually completely disregarding the while loop and concluding that the output will always be "-999", and so it doesn't even bother with a call to the function, instead hard coding 0xFC19. I'll have another look with the serial.prints enabled so that the function still gets called, but this is a strong pointer I think.


EDIT 2:

For those who really care, here's a link to the disassembled code exactly as shown above (in the UB case):

https://justpaste.it/vwu8

If you look carefully, the compiler seems to be designating register 28 as the location of i and "initialising" it to zero in line d8. This register gets treated as if it contains i throughout in the while loops, if statements etc, which is why the code appears to work and the print statements output as expected (e.g. line 122 where "i" gets incremented).

However, when it comes to returning this pseudo-variable, this is a step too far for our tried and tried-upon compiler; it draws the line, and dumps us to the other return statement (line 120 jumps to line 132, loading "-999" into registers 24 and 25 before returning to main()).

Or at least, that's as far as I can get with my limited grasp of assembly. Moral of the story is weird stuff happens when your code's behaviour is undefined.

CharlieB
  • 626
  • 6
  • 14
  • It is good practice to only have one return point from a function. I would suggest refactoring the code. Also as others have pointed out never use an uninitialised variable. You don't necessarily have to initialise it where is it declared if you are sure it will be initialised elsewhere before it is used. – Realtime Rik Jul 01 '16 at 16:54
  • @RealtimeRik I've not heard many people complaining about multiple `return` points. Can you point me to an article by a trusted expert who advises this? I mean it seems like by far the cleanest way to exit here. What would you recommend instead? `goto`? – underscore_d Jul 01 '16 at 17:01
  • Err, not a goto. They have there uses but are few and far between. I single return point makes the code much more testable. I will try and find some reference for you. – Realtime Rik Jul 01 '16 at 17:04
  • 3
    @RealtimeRik I don't believe that's a universally accepted viewpoint. I find early-return code *much* more readable than code artifically stretched to a single return point. – Angew is no longer proud of SO Jul 01 '16 at 17:07
  • 1
    @RealtimeRik Please do link us to a reference, or at least summarise common situations in which you see multiple `return` points being used and what you would do instead. "good practice" has a connotation that it's a generally accepted rule-of-thumb, but I can't recall reading any specific recommendation in this vein. – underscore_d Jul 01 '16 at 17:16
  • It may not be universally accepted, but every company I worked for over the last 20 years enforce this, and so does the Misra standard. Personally I think it is good practice in most circumstances. – Realtime Rik Jul 01 '16 at 17:17
  • @RealtimeRik OK, I'll do it myself: http://stackoverflow.com/questions/36707/should-a-function-have-only-one-return-statement http://programmers.stackexchange.com/questions/118703/where-did-the-notion-of-one-return-only-come-from Suffice it to say this "good practice" attracts a lot of debate and that the sensible answer is probably somewhere in between, depending on context. For my relevant contexts, I'll keep my multiple return points so that I don't need to declare temporary variables for returning (relevant to RVO), awful arrow indentation for error checks, hacks with `while`/`break`, etc – underscore_d Jul 01 '16 at 17:31
  • Sorry, been busy. It does appear this is somewhat of a open discussion point with people on both sides of the fence. I tend to code to the MISRA standard due to work requirements and always prefer a single exit point where possible. I'm not going to argue about it though. Tabs vs Spaces anyone? – Realtime Rik Jul 01 '16 at 17:37
  • Actually, the final read will get the NULL char, assuming the string is properly terminated. This is a hangover from the non-simplified version of this code which also detects nulls, but I removed that to get a MWE of this bug going – CharlieB Jul 01 '16 at 18:08
  • @CharlieB Ah, of course. I use `std::string` so much that I never have to think about the null terminator! I'll remove my comment since you clearly know how indexing works in actual arrays :-) – underscore_d Jul 01 '16 at 18:09
  • #KeepingItRealWithMicroprocessors – CharlieB Jul 01 '16 at 18:10
  • 1
    @CharlieB Definitely! I'll get that Mega2560 out of the cupboard eventually, and when I do... it'll be glorious. Plus I've had some really fun forays into Z80 and 68000 land in recent years, need to think of another excuse to go back. :D – underscore_d Jul 01 '16 at 18:10
  • @CharlieB forgot to say, thanks for showing the disassembly. It's always good to have concise evidence of the totally counterintuitive things that can result from UB. Did you find out whether including the `print` makes things even weirder? – underscore_d Jul 02 '16 at 10:51
  • @underscore_d it did make it weirder, I had some trouble deciphering exactly what was going on in the full version. If you're interested I'll post the full disassembled version on Monday – CharlieB Jul 02 '16 at 11:53
  • @CharlieB Cool, let me know once you've edited! – underscore_d Jul 02 '16 at 12:05
  • 1
    @underscore_d Full version uploaded for your delectation. – CharlieB Jul 04 '16 at 10:48
  • @CharlieB Nice! "**Moral of the story is weird stuff happens when your code's behaviour is undefined.**" - an essential point, well demonstrated :-) – underscore_d Jul 04 '16 at 11:06

2 Answers2

10

Like all basic types of non-static storage duration, declaring but not defining an int does not cause default initialisation. It leaves the variable uninitialised. That does not mean i just holds a random value. It holds no (known, valid) value, and therefore you're not allowed to read it yet.

Here's the relevant quote from the C++11 Standard, via Angew in the comments. This wasn't a new restriction, nor has it changed since then:

C++11 4.1/1, talking about an lvalue-to-rvalue conversion (basically reading a variable's value): "If the object to which the glvalue refers is ... uninitialized, a program that necessitates this conversion has undefined behavior."

Any read of an unitialised variable causes undefined behaviour, and so anything can happen. Rather than your program continuing to function as expected using some unknown default value, compilers can make it do absolutely anything, because the behaviour is undefined, and the Standard imposes no requirements on what should happen in such a scenario.

In practical terms, that usually means an optimising compiler might simply remove any code that relies in any way on UB. There's no way to make a correct decision about what to do, so it's perfectly valid to decide to do nothing (which just happens also to be an optimisation for size and often speed). Or as commenters have mentioned, it might keep the code but replace attempts to read i with the nearest unrelated value to hand, or with different constants in different statements, or etc.

Printing a variable doesn't count as 'checking it' as you think, so that makes no difference. There is no way to 'check' an uninitialised variable and thereby to inoculate yourself against UB. The behaviour of reading the variable is only defined if the program has already written a specific value to it.

There is no point in us speculating on why particular arbitrary types of UB occur: you just need to fix your code so that it operates deterministically.

Why do you want to use it uninitialised anyway? Is this just 'academic'?

gsamaras
  • 66,800
  • 33
  • 152
  • 256
underscore_d
  • 5,331
  • 3
  • 29
  • 56
  • 1
    Having spotted the typo I've fixed my code now, this question is more of a post mortem to help me get a better idea of what goes on under the hood in a compiler. My understanding is that an int takes up a certain amount of space. Without initialising it, it'll contain random junk, but this should still represent some integer. That doesn't explain the odd behaviour I've seen though – CharlieB Jul 01 '16 at 16:37
  • 2
    @CharlieB See my edit. UB _precisely_ explains what's going on here. – underscore_d Jul 01 '16 at 16:39
  • I suppose my core question is "How can I call a print statement that returns a value and then immediately return the same value, but get a different answer?" It seems like this code is skipping around from line to line even after it's resolved the UB caused by `i`. – CharlieB Jul 01 '16 at 16:44
  • Or "How can I return -999 without executing the preceding statement?" – CharlieB Jul 01 '16 at 16:45
  • 2
    @CharlieB it *hasn't* 'resolved' the UB--the UB means the execution of the code has no behavioral guarantees. if you want a more precise answer of what exactly happened, post the disassembly of the compiled code :) – obataku Jul 01 '16 at 16:45
  • 1
    @CharlieB Just because you see `0` printed doesn't preclude the possibility that some junk data passed to `println` just happens to interpreted as `0` for display. – James Adkison Jul 01 '16 at 16:48
  • 1
    @CharlieB Again, I maintain that "There is no point in us speculating on why particular arbitrary types of UB occur". However, should you wish to understand this exact instance on your exact compiler with this exact wind direction... **oldrinb** is on the mark. I also recommend thinking about the idea mentioned just now by **James** and the interesting comment made by **Angew** on another answer, since although again _anything_ can happen, reality is often _slightly_ more predictable. – underscore_d Jul 01 '16 at 16:48
  • 1
    I know it's not a very satisfying answer but it's all we can say with certainty. giving disassembled code would let us try to figure out exactly what the compiler did. since the Arduino IDE uses a port of g++ internally (avr-g++), you should find out how to pass the -S flag to generate the resultant ARM assembly for your SSCCE – obataku Jul 01 '16 at 16:49
  • Only the *value* of an unitialised variable is undefined or more accurately perhaps *non-deterministic*. Simply reading such a variable does not cause *undefined behaviour* in the sense that "anything can happen", although acting on that value however may cause incorrect and *non-deterministic* behaviour - for example de-referencing an invalid pointer will cause undefined behaviour. – Clifford Jul 01 '16 at 20:02
  • @Clifford I'm all about nuance! Could you elaborate on these distinctions? OP seems now to have shown that any decision made based on the value of `i` is optimised away when `i` has an 'undefined' value. And this is a frequently reported pattern. So there's definitely UB here by my understanding. – underscore_d Jul 01 '16 at 20:32
  • @underscore_d : I am saying just that the *value* is undefined, not the *behaviour*. If the code reads and processes such a value it may not behave as intended, but that is not the same as *undefined behaviour* which is an attribute of the compiler not the code being compiled. – Clifford Jul 01 '16 at 22:22
  • @Clifford Well, yeah, UB is an attribute of written code, not a compiled program. Once the program is compiled, it behaves in a specific way, rather than capriciously altering its logic (but of course the observed result may change between runs). But when you say "value is undefined, not the behaviour", that sounds like many people's wrong idea, as the OP originally and other answers, of what happens with uninitialised variables: 'it has a random value and your program acts [as normal] on that random value'. We've seen that isn't true. Entire sections coded to use the variable have been elided – underscore_d Jul 01 '16 at 22:36
  • I appreciate that it is a somewhat pedantic point, but you asked me to elaborate, and appear to have made more of it than intended. It was intended as clarification rather than criticism. It is merely a linguistic semantic difference between "behavior" and "value", and "undefined" and "non-deterministic". – Clifford Jul 02 '16 at 08:01
  • 1
    @Clifford But this is not about linguistic semantics. "Undefined behaviour" is a defined term in the C++ standard, explicitly called out by numerous rules. One of them is C++11 4.1/1, talking about an lvalue-to-rvalue conversion (basically reading a variable's value): "If the object to which the glvalue refers is ... uninitialized, a program that necessitates this conversion has undefined behavior." So reading an uninitialised variable gives the program Undefined Behaviour, period. – Angew is no longer proud of SO Jul 02 '16 at 08:55
  • @Angew: I appreciate what you are saying, but "anything can happen" as a consequence is somewhat overstating the effect - rather "any value can result" perhaps? – Clifford Jul 02 '16 at 10:02
  • 1
    @Clifford **No**, that's the point. *Any behaviour* is allowed. The optimiser is allowed to assume that code exhibiting UB is never reached, for example. Please read this [LLVM blog](http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html), especially the [second part](http://blog.llvm.org/2011/05/what-every-c-programmer-should-know_14.html). – Angew is no longer proud of SO Jul 02 '16 at 10:08
  • 1
    @Clifford Really, if `i` is uninitialised, the compiler could say: "Passing `i` to a function? I will just leave that register intact instead. Testing `i` in a conditional? I will just leave the flag register intact instead." – Angew is no longer proud of SO Jul 02 '16 at 10:10
0

When you don't initialise the variable it has a random value, whatever was in the memory address, so while (i <=strlen(str)) is going to behave unpredictably. You should always initialise.

(Visual Studio Debug configurations automatically initialise variables.)

Code Gorilla
  • 667
  • 8
  • 20
  • Cheers Matt, I realise that I need to initialise variables and this was indeed a typo. However I'm trying to understand what's going on here, out of interest. Surely the comparison should always return either true or false and the while loop behave accordingly, no matter the garbage in `i`? – CharlieB Jul 01 '16 at 16:35
  • 2
    @CharlieB No. The compiler is fully within its rights to say "I can see that `i` is never initialised, so I can assume it to have whatever value I consider most convenient." Such as not altering a register when passing `i` to a function, and assuming all conditionals involving `i` are false. (Note: I don't know if a compiler actually does that, but it would explain the behaviour you're seeing). – Angew is no longer proud of SO Jul 01 '16 at 16:36
  • 1
    It doesn't have a random value. It invokes undefined behaviour. The two are fundamentally different and under no obligation to produce similar results. People wanting random numbers use the stdlib's random number generator classes – underscore_d Jul 01 '16 at 16:55
  • Also, "(Visual Studio Debug configurations automatically initialise variables.)" - what, by acting as if the user had default-initialised them? That would alter code behaviour in a way that will lull debug users into a false sense of security, before throwing them to the wolves when they switch to release mode and turn on the optimiser, thereby giving UB a double mandate. Surely I'm misunderstanding? – underscore_d Jul 01 '16 at 16:59
  • 1
    @underscore_d No, you're not misunderstanding. That's what the default behaviour of VS debug builds is. – Angew is no longer proud of SO Jul 01 '16 at 17:09
  • @Angew Posts I found suggest it initialises the underlying memory, thus potentially affecting what happens to any UB read operations that happen to succeed... but surely it doesn't 're-define' the behaviour when in debug mode only? Sorry, I just find this incredible! Do you know where I can read about this? – underscore_d Jul 01 '16 at 17:11
  • @Angew This old article indicates that variables aren't default-initiailised in debug mode but the fact that they might be located in different regions of memory can lead to different behaviour of code that reads variables with UB, if it actually reads their physical (non-)value: http://www.codeproject.com/Articles/548/Surviving-the-Release-Version I'm aware of how comedically old it is and that since the 2005 version, memory regions for such variables are filled with deterministic patterns. I remain highly sceptical that debug mode 'officially' inits such vars and thus defines their behaviour – underscore_d Jul 01 '16 at 17:44