1645

Is if (a < 901) faster than if (a <= 900)?

Not exactly as in this simple example, but there are slight performance changes on loop complex code. I suppose this has to do something with generated machine code in case it's even true.

MC Emperor
  • 17,266
  • 13
  • 70
  • 106
snoopy
  • 14,122
  • 3
  • 22
  • 49
  • 162
    I see no reason why this question should be closed (and especially not deleted, as the votes are currently showing) given its historical significance, the quality of the answer, and the fact that the other top questions in [tag:performance] remain open. At most it should be locked. Also, even if the question itself is misinformed/naive, the fact that it appeared in a book means that the original misinformation exists out there in "credible" sources somewhere, and this question is therefore constructive in that it helps to clear that up. – Jason C Mar 22 '14 at 23:49
  • 33
    You never did tell us *which book* you're referring to. – Jonathon Reinhart Jul 24 '14 at 19:47
  • 176
    Typing ` – Deqing Apr 21 '16 at 00:19
  • 1
    This is an excellent question and would be interested in how it works involving an interpreted language such as Python. Would consider posting a new question such as "Is > faster than >= in Python?" but that could be considered a duplicate question. Guidance welcome. – Rick Henderson Jul 26 '16 at 15:12
  • 9
    It was true on the 8086. – Joshua Nov 15 '16 at 16:28
  • 9
    The number of upvotes clearly shows that there are hundreds of people who heavily overoptimize. – m93a Feb 17 '18 at 13:50
  • after reading the answers, I would say ' – jcazor Jun 13 '19 at 14:59
  • 3
    @JonathonReinhart, there has never been a book. That is a lie I've told when I was younger – snoopy Jan 20 '20 at 18:07
  • All of the answers seem to be about _C_, whereas the question is tagged _C++_ -- where there is simply no telling, what ` – Mikhail T. Jan 26 '21 at 23:01

14 Answers14

1760

No, it will not be faster on most architectures. You didn't specify, but on x86, all of the integral comparisons will be typically implemented in two machine instructions:

  • A test or cmp instruction, which sets EFLAGS
  • And a Jcc (jump) instruction, depending on the comparison type (and code layout):
    • jne - Jump if not equal --> ZF = 0
    • jz - Jump if zero (equal) --> ZF = 1
    • jg - Jump if greater --> ZF = 0 and SF = OF
    • (etc...)

Example (Edited for brevity) Compiled with $ gcc -m32 -S -masm=intel test.c

    if (a < b) {
        // Do something 1
    }

Compiles to:

    mov     eax, DWORD PTR [esp+24]      ; a
    cmp     eax, DWORD PTR [esp+28]      ; b
    jge     .L2                          ; jump if a is >= b
    ; Do something 1
.L2:

And

    if (a <= b) {
        // Do something 2
    }

Compiles to:

    mov     eax, DWORD PTR [esp+24]      ; a
    cmp     eax, DWORD PTR [esp+28]      ; b
    jg      .L5                          ; jump if a is > b
    ; Do something 2
.L5:

So the only difference between the two is a jg versus a jge instruction. The two will take the same amount of time.


I'd like to address the comment that nothing indicates that the different jump instructions take the same amount of time. This one is a little tricky to answer, but here's what I can give: In the Intel Instruction Set Reference, they are all grouped together under one common instruction, Jcc (Jump if condition is met). The same grouping is made together under the Optimization Reference Manual, in Appendix C. Latency and Throughput.

Latency — The number of clock cycles that are required for the execution core to complete the execution of all of the μops that form an instruction.

Throughput — The number of clock cycles required to wait before the issue ports are free to accept the same instruction again. For many instructions, the throughput of an instruction can be significantly less than its latency

The values for Jcc are:

      Latency   Throughput
Jcc     N/A        0.5

with the following footnote on Jcc:

7) Selection of conditional jump instructions should be based on the recommendation of section Section 3.4.1, “Branch Prediction Optimization,” to improve the predictability of branches. When branches are predicted successfully, the latency of jcc is effectively zero.

So, nothing in the Intel docs ever treats one Jcc instruction any differently from the others.

If one thinks about the actual circuitry used to implement the instructions, one can assume that there would be simple AND/OR gates on the different bits in EFLAGS, to determine whether the conditions are met. There is then, no reason that an instruction testing two bits should take any more or less time than one testing only one (Ignoring gate propagation delay, which is much less than the clock period.)


Edit: Floating Point

This holds true for x87 floating point as well: (Pretty much same code as above, but with double instead of int.)

        fld     QWORD PTR [esp+32]
        fld     QWORD PTR [esp+40]
        fucomip st, st(1)              ; Compare ST(0) and ST(1), and set CF, PF, ZF in EFLAGS
        fstp    st(0)
        seta    al                     ; Set al if above (CF=0 and ZF=0).
        test    al, al
        je      .L2
        ; Do something 1
.L2:

        fld     QWORD PTR [esp+32]
        fld     QWORD PTR [esp+40]
        fucomip st, st(1)              ; (same thing as above)
        fstp    st(0)
        setae   al                     ; Set al if above or equal (CF=0).
        test    al, al
        je      .L5
        ; Do something 2
.L5:
        leave
        ret
Jonathon Reinhart
  • 116,671
  • 27
  • 221
  • 298
  • 243
    @Dyppl actually `jg` and `jnle` are the same instruction, `7F` :-) – Jonathon Reinhart Aug 27 '12 at 05:30
  • @JonathonReinhart are you sure your example is not the other way around? I.e. isn't ` – maksimov Aug 27 '12 at 09:43
  • @maksimov it is probably correct, the asm code for `(a < b) ...` says: `jump if a >= b` which is equivalent to `do something if a < b`. – Ben Aug 27 '12 at 10:15
  • 21
    Not to mention that the optimizer can modify the code if indeed one option is faster than the other. – Elazar Leibovich Aug 28 '12 at 06:33
  • 3
    just because something results in the same amount of instructions doesn't necessarily mean that the sum total time of executing all those instructions will be the same. Actually more instructions could be executed faster. Instructions per cycle is not a fixed number, it varies depending on the instructions. – jontejj May 31 '13 at 16:51
  • 23
    @jontejj I'm very much aware of that. Did you even *read* my answer? I didn't state anything about the same *number* of instructions, I stated that they are compiled to essentially the exact same *instrutcions*, except one jump instruction is looking at one flag, and the other jump instruction is looking at two flags. I believe I've given more than adequate evidence to show that they are semantically identical. – Jonathon Reinhart Jun 01 '13 at 05:22
  • Yeah, saw that now. I still think your first sentence leads someone to draw that conclusion for the wrong reasons. "You didn't specify, but on x86, all of the integral comparisons will be typically implemented in two machine instructions" this is actually not the main point that you should be making yet it's the first one you make. One would have to read your edited part furthest down to understand why. Otherwise your answer is top-notch! – jontejj Jun 03 '13 at 07:20
  • "If one thinks about the actual circuitry used to implement the instructions, one can assume that there would be simple AND/OR gates on the different bits in EFLAGS, to determine whether the conditions are met. There is then, no reason that an instruction testing two bits should take any more or less time than one testing only one (Ignoring gate propagation delay, which is much less than the clock period.)" I think this should be your main point. – jontejj Jun 03 '13 at 07:21
  • 2
    @jontejj You make a very good point. For as much visibility as this answer gets, I should probably give it a little bit of a cleanup. Thanks for the feedback. – Jonathon Reinhart Jun 03 '13 at 14:20
  • I'd just add that `cmp` sets the `FLAGS` register "in the same manner as the `sub` instruction". In fact, "The comparison is performed by subtracting the second operand from the first operand" - so carry / borrow propagation is involved. i.e., it is not a simple bit-wise operation in terms of hardware 'parallelism'. – Brett Hale Feb 03 '15 at 03:45
  • @Brett indeed, but the Jcc instruction tests bits which are already set. Your points are valid, but I don't see how it really applies to the question at hand. – Jonathon Reinhart Feb 03 '15 at 04:33
  • *"This holds true for x87 floating point as well"* Is that some new architecture I've never heard of? ;) – nyuszika7h Jul 08 '15 at 10:19
  • 1
    @JonathonReinhart: In x86, some instructions set some flags, but leave others unchanged (e.g. `inc/dec`). Current out-of-order-execution CPUs rename flag bits separately, so `inc` doesn't have an input dependency on the previous value of the flags. A `jcc` that depends on multiple flags set by more than one instruction requires an extra uop to merge the flags (or in earlier Intel designs, causes a partial-flags stall.) So every `jcc` is the same internally, but their different dependencies can be an issue. Things used to be worse before flag-renaming improved. – Peter Cordes Aug 08 '15 at 03:47
  • @JonathonReinhart: also, see http://agner.org/optimize/ for more detailed info than you get from Intel's own manuals. – Peter Cordes Aug 08 '15 at 03:48
  • Forgot to mention this last time, but not every JCC is the same. Some can macro-fuse with an immediately-preceding CMP or TEST instruction on Core2 and Nehalem. (And on Intel Sandybridge-family, [with many different ALU instructions](http://stackoverflow.com/a/31778403/224132).) AMD CPUs that can macro-fuse at all (Bulldozer-family) can do it for any JCC, even the weird ones like JP that Intel never macro-fuses with anything. – Peter Cordes Nov 25 '16 at 11:59
  • @PeterCordes Since I wrote this answer, I've taken a graduate level computer architecture class and gained a lot more understanding of the intricacies of pipelining and register renaming, etc. I'm still quite convinced that my answer (essentially just "*No.*") is correct, but I'm not quite sure what to add to my answer to make it correct from the standpoint of a modern out-of-order superscalar CPU. Perhaps the simple answer is "regardless of the underlying machinery, the hardware is capable of looking at multiple condition flags simultaneously". Any thoughts? – Jonathon Reinhart Mar 07 '17 at 14:33
  • Yeah, it's not the actual testing of multiple bits in EFLAGS that's ever the problem on x86. It's partial-flags renaming, since not all instructions write every flag, but CPUs try to avoid false dependencies by renaming different parts of EFLAGS separately. (This isn't an issue for < vs. <=). On Intel pre-Haswell, reading a flag that was left unmodified by the previous flag-writing instruction is slow. (Much slower on pre-Sandybridge, like you can see in this question: http://stackoverflow.com/questions/32084204/problems-with-adc-sbb-and-inc-dec-in-tight-loops-on-some-cpus.) – Peter Cordes Mar 08 '17 at 20:53
  • Anyway, my comments were only trying to correct the over-generalization that all JCCs are equal. They aren't, because some can macro-fuse and some can't, even when used after an instruction like CMP that writes all flags (avoiding any partial-flag renaming stalls or slowdowns). – Peter Cordes Mar 08 '17 at 20:56
  • I came here looking for a simple answer and found something that changed my life haha. – Sam Jul 08 '20 at 22:04
605

Historically (we're talking the 1980s and early 1990s), there were some architectures in which this was true. The root issue is that integer comparison is inherently implemented via integer subtractions. This gives rise to the following cases.

Comparison     Subtraction
----------     -----------
A < B      --> A - B < 0
A = B      --> A - B = 0
A > B      --> A - B > 0

Now, when A < B the subtraction has to borrow a high-bit for the subtraction to be correct, just like you carry and borrow when adding and subtracting by hand. This "borrowed" bit was usually referred to as the carry bit and would be testable by a branch instruction. A second bit called the zero bit would be set if the subtraction were identically zero which implied equality.

There were usually at least two conditional branch instructions, one to branch on the carry bit and one on the zero bit.

Now, to get at the heart of the matter, let's expand the previous table to include the carry and zero bit results.

Comparison     Subtraction  Carry Bit  Zero Bit
----------     -----------  ---------  --------
A < B      --> A - B < 0    0          0
A = B      --> A - B = 0    1          1
A > B      --> A - B > 0    1          0

So, implementing a branch for A < B can be done in one instruction, because the carry bit is clear only in this case, , that is,

;; Implementation of "if (A < B) goto address;"
cmp  A, B          ;; compare A to B
bcz  address       ;; Branch if Carry is Zero to the new address

But, if we want to do a less-than-or-equal comparison, we need to do an additional check of the zero flag to catch the case of equality.

;; Implementation of "if (A <= B) goto address;"
cmp A, B           ;; compare A to B
bcz address        ;; branch if A < B
bzs address        ;; also, Branch if the Zero bit is Set

So, on some machines, using a "less than" comparison might save one machine instruction. This was relevant in the era of sub-megahertz processor speed and 1:1 CPU-to-memory speed ratios, but it is almost totally irrelevant today.

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Lucas
  • 7,815
  • 2
  • 27
  • 45
  • 10
    Additionally, architectures like x86 implement instructions like `jge`, which test both the zero and sign/carry flags. – greyfade Aug 27 '12 at 18:23
  • 1
    It should be noted that in the computation loop of a process/function/program, an additional instruction could make a difference. More relevant than the speed is the fact, as @greyfade mentioned, that most modern CISC processors have jump/branch instructions that check both the carry and the zero flags, thus still only using a single instruction. – Ethan Reesor Aug 27 '12 at 19:47
  • 3
    Even if it is true for a given architecture. What are the odds that none of the compiler writers ever noticed, and added an optimisation to replace the slower with the faster? – Jon Hanna Aug 27 '12 at 21:50
  • 8
    This is true on the 8080. It has instructions to jump on zero and jump on minus, but none that can test both simultaneously. –  Aug 27 '12 at 22:43
  • 4
    This is also the case on the 6502 and 65816 processor family, which extends to the Motorola 68HC11/12, too. – Lucas Aug 27 '12 at 22:56
  • 2
    @JonHanna: This *is* the optimized version. For a loop, the branch-if-equal instruction is only encountered on the last iteration of the loop, so its impact is amortized down to some fraction of a cycle. Inverting the test would require placing an additional instruction into the inner loop which would impact *every* loop iteration. Also, it may not be possible to reverse the order of comparison, because these were typically accumulator architectures, and spilling the accumulator to memory would have been vastly more expensive than just adding the extra conditional branch instruction. – Lucas Aug 28 '12 at 01:48
  • 2
    Lucas: @Jon might mean `A < (B + 1)` optimization if B is a constant. – jfs Aug 28 '12 at 06:07
  • 1
    I like this answer because it remind me of the fun I had with a 6502 and how much I missed having to think about the flags once I moved to C. It also demonstrates that the question is deeper and more interesting than most people have given it credit. –  Aug 28 '12 at 17:20
  • 31
    Even on the 8080 a `<=` test can be implemented in _one_ instruction with swapping the operands and testing for `not =`) This is the desired `<=` with swapped operands: `cmp B,A; bcs addr`. That's the reasoning this test was omitted by Intel, they considered it redundant and you couldn't afford redundant instructions at those times :-) – Gunther Piez Aug 29 '12 at 11:10
  • 2
    I'm pretty sure some of these architectures are still in embedded use so even if they were born in the '80s they didn't necessarily die there. – hippietrail Aug 31 '12 at 17:48
  • 1
    @hirschhornsalz You are absolutely correct. I'm not convinced that there is any architecture and scenario where this double test would be required. – Jonathon Reinhart Jun 07 '13 at 05:44
  • @JonathonReinhart you're basically right. Even in the 80's a peephole optimizer would invert the comparison or reorder the `if/else` code branches to eliminate the additional test. But a naive compiler or inexperienced assembly language programmer might still produce such output. – Lucas Jun 07 '13 at 16:40
  • @Lucas: Actually, that's not true for 6502 (and 65816). 6502 has two branch comparison instructions of interest in this case: BCC and BCS. BCC works like >=, and BCS works like <. :="" arguments="" bcc="" bcs="" can="" cmp="" example="" for="" if="" implements="" label="" lda="" need="" simply="" swap="" you=""> – Konrad Borowski Oct 17 '13 at 15:32
  • @GlitchMr: Yes, the 6502 has a negated form of the carry flag test, but the point I was trying to make is the need for two separate instruction to test the carry flag (BCC/BCS) and the zero flag (BEQ/BNE) since the 6502 has no instruction for testing multiple values of the P register simultaneously. Having the BCC/BCS pair just makes it trivial to invert the comparison without having to change the value in the accumulator. – Lucas Oct 17 '13 at 17:52
  • 1
    @hirschhornsalz: Inverting the operands is a standard technique on the 8080, but various factors may make it better to evaluate a particular operand first. For example, given `static unsigned char x;`, the expression `x < 20` could be evaluated as `ld a,(x) / cmp 20 / jnc nope` but reversing the operands of `x > 20` would require something like `ld a,20 / ld hl,x / cmp (hl) / jnc nope`. Better would be to keep the order but substitute `x <= 21`: `ld a,(x) / cmp 21 / jc nope`. – supercat Oct 27 '17 at 15:56
  • Like supercat said, smart compilers can and do compile C++ comparisons into efficient asm using various tricks. If either operand is a compile-time constant, it can make asm that checks `x < 21` instead of `x <= 20`. Or on x86, maybe compilers choose to make constants smaller magnitude, so they'll fit in a signed 8-bit immediate instead of a 32-bit immediate. e.g. `x <= 127` instead of `x < 128`. But if both are runtime variables, `for( ... ; i < size ;)` is guaranteed not to be an infinite loop, but `i <= size` might be (for unsigned)! This can defeat optimizations. – Peter Cordes Jan 20 '19 at 10:20
95

Assuming we're talking about internal integer types, there's no possible way one could be faster than the other. They're obviously semantically identical. They both ask the compiler to do precisely the same thing. Only a horribly broken compiler would generate inferior code for one of these.

If there was some platform where < was faster than <= for simple integer types, the compiler should always convert <= to < for constants. Any compiler that didn't would just be a bad compiler (for that platform).

David Schwartz
  • 166,415
  • 16
  • 184
  • 259
  • 6
    +1 I agree. Neither ` – autistic Jun 10 '13 at 02:52
  • There are still some edge cases where a comparison having one constant value could be slower under <=, e.g., when the transformation from `(a < C)` to `(a <= C-1)` (for some constant `C`) causes `C` to be more difficult to encode in the instruction set. For example, an instruction set may be able to represent signed constants from -127 to 128 in a compact form in comparisons, but constants outside that range have to loaded using either a longer, slower encoding, or another instruction entirely. So a comparison like `(a < -127)` may not have a straightforward transformation. – BeeOnRope Jun 16 '16 at 02:18
  • @BeeOnRope The issue was not whether performing operations that differed due to having different constants in them could affect performance but whether *expressing* the *same* operation using different constants could affect performance. So we're not comparing `a > 127` to `a > 128` because you have no choice there, you use the one you need. We're comparing `a > 127` to `a >= 128`, which can't require different encoding or different instructions because they have the same truth table. Any encoding of one is equally an encoding of the other. – David Schwartz Jun 16 '16 at 04:36
  • I was responding in a general way to your statement that "If there was some platform where [<= was slower] the compiler should always convert `<=` to ` 127` and `a >= 128` are equivalent and a compiler should encode both forms in the (same) fastest way, but that's not inconsistent with what I said. – BeeOnRope Jun 16 '16 at 20:36
67

I see that neither is faster. The compiler generates the same machine code in each condition with a different value.

if(a < 901)
cmpl  $900, -4(%rbp)
jg .L2

if(a <=901)
cmpl  $901, -4(%rbp)
jg .L3

My example if is from GCC on x86_64 platform on Linux.

Compiler writers are pretty smart people, and they think of these things and many others most of us take for granted.

I noticed that if it is not a constant, then the same machine code is generated in either case.

int b;
if(a < b)
cmpl  -4(%rbp), %eax
jge   .L2

if(a <=b)
cmpl  -4(%rbp), %eax
jg .L3
Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Adrian Cornish
  • 20,973
  • 11
  • 50
  • 74
  • 10
    Note that this is specific to x86. – Michael Petrotta Aug 27 '12 at 02:17
  • Indeed - I should have said that - but any compiler could be smart enough to generate this code – Adrian Cornish Aug 27 '12 at 02:19
  • 10
    I think you should use that `if(a <=900)` to demonstrate that it generates exactly the same asm :) – Lipis Aug 27 '12 at 02:22
  • @Lipis Sorry - I do not understand your comment - could you clarify - I showed the asm generated from both `if` statements – Adrian Cornish Aug 27 '12 at 02:23
  • 2
    @AdrianCornish Sorry.. I edited it.. it's more or less the same.. but if you change the second if to <=900 then the asm code will be exactly the same :) It's pretty much the same now.. but you know.. for the OCD :) – Lipis Aug 27 '12 at 02:25
  • Ah I get it - sorry I missed the different value in the OP's original question - my point was that the compiler edited the value in the generated ASM. – Adrian Cornish Aug 27 '12 at 02:28
  • 1
    @AdrianCornish Your two statements aren't the same two statements as in the question. One of his has 900, not 901. – Qsario Aug 27 '12 at 02:28
  • @Qsario quite right - I missed that - the point still stands though that the compiler is editing the values – Adrian Cornish Aug 27 '12 at 02:29
  • 1
    What about `if (a <= INT_MAX)`? – Boann Aug 27 '12 at 02:30
  • You're correct, but it would be good to edit in the other original statement as well for completeness :) – Qsario Aug 27 '12 at 02:31
  • @AdrianCornish Yes you're totally right.. and we are on the same page :) I edited it.. hope you don't mind.. – Lipis Aug 27 '12 at 02:31
  • 3
    @Boann That might get reduced to if (true) and eliminated completely. – Qsario Aug 27 '12 at 02:32
  • 1
    @Qsario I think that muddies the point because in that case both asm statements become `cmpl $900, -4(%rbp)` so it is a little harder to see the difference. Since I am showing the asm from my code and not the OP's it is not wrong - but highlights the error in the book – Adrian Cornish Aug 27 '12 at 02:33
  • please consider the following: `typedef int a`, `typedef int b`, `a c = 1;` `b d = 2;` `if( c < d )` & `if( c <= d )` as c and d are different types – snoopy Aug 27 '12 at 02:33
  • I wanted to see the ASM generated code for it. To be honest there are a lot more of examples I would like to see ASM generated code, specially about `char`s – snoopy Aug 27 '12 at 02:35
  • @ViniyoShouta Try it for yourself - `g++ --save-temps myfile.cc` will give you the a `.s` file so you can read the asm for yourself :-) – Adrian Cornish Aug 27 '12 at 02:37
  • @Lipis A fair edit but I am glad you changed it back as I think it highlights the difference better. I get the OCD - whats why we are programmers :-) – Adrian Cornish Aug 27 '12 at 02:54
  • 5
    No one has pointed out that this optimization *only applies to constant comparisons*. I can guarantee it will *not* be done like this for comparing two variables. – Jonathon Reinhart Aug 27 '12 at 03:05
  • @JonathonReinhart Totally agree - but the OP's question was with constants. But I see the asm generated is the same - except that the LHS is moved to a register `cmpl -4(%rbp), %eax` – Adrian Cornish Aug 27 '12 at 03:06
  • 1
    @AdrianCornish you're not showing the whole picture. That's just the compare, which sets the flags, which is always the same. You still will have a differnt `Jcc` instruction depending on the conditional. See my example. – Jonathon Reinhart Aug 27 '12 at 06:37
  • @JonathonReinhart Good point. Edited to include the jump statements. – Adrian Cornish Aug 28 '12 at 00:13
  • BTW, gcc reduces the magnitude of immediates when it can, because for example on x86 an immediate from -128 .. 127 only needs 1 byte instead of 4. (There's no harm in just always applying the transformation for compile-time constants, except maybe on ARM where having all the set bits closer together is more likely to make it encodeable as an immediate... Would be interesting to try there with `x < 0x00f000` and see if it turned into `x <= 0x00efff`) – Peter Cordes Jan 20 '19 at 10:26
51

For floating point code, the <= comparison may indeed be slower (by one instruction) even on modern architectures. Here's the first function:

int compare_strict(double a, double b) { return a < b; }

On PowerPC, first this performs a floating point comparison (which updates cr, the condition register), then moves the condition register to a GPR, shifts the "compared less than" bit into place, and then returns. It takes four instructions.

Now consider this function instead:

int compare_loose(double a, double b) { return a <= b; }

This requires the same work as compare_strict above, but now there's two bits of interest: "was less than" and "was equal to." This requires an extra instruction (cror - condition register bitwise OR) to combine these two bits into one. So compare_loose requires five instructions, while compare_strict requires four.

You might think that the compiler could optimize the second function like so:

int compare_loose(double a, double b) { return ! (a > b); }

However this will incorrectly handle NaNs. NaN1 <= NaN2 and NaN1 > NaN2 need to both evaluate to false.

ridiculous_fish
  • 14,125
  • 1
  • 41
  • 48
  • Luckily it doesn't work like this on x86 (x87). `fucomip` sets ZF and CF. – Jonathon Reinhart Aug 27 '12 at 20:30
  • 4
    @JonathonReinhart: I think you're misunderstanding what the PowerPC is doing -- the condition register `cr` *is* the equivalent to flags like `ZF` and `CF` on the x86. (Although the CR is more flexible.) What the poster is talking about is moving the result to a GPR: which takes two instructions on PowerPC, but x86 has a conditional move instruction. – Dietrich Epp Aug 28 '12 at 06:19
  • @DietrichEpp What I meant to add after my statement was: Which you can then immediately jump based upon the value of EFLAGS. Sorry for not being clear. – Jonathon Reinhart Aug 28 '12 at 07:16
  • 1
    @JonathonReinhart: Yes, and you can also immediately jump based on the value of the CR. The answer is not talking about jumping, which is where the extra instructions come from. – Dietrich Epp Aug 28 '12 at 07:38
34

Maybe the author of that unnamed book has read that a > 0 runs faster than a >= 1 and thinks that is true universally.

But it is because a 0 is involved (because CMP can, depending on the architecture, replaced e.g. with OR) and not because of the <.

glglgl
  • 81,640
  • 11
  • 130
  • 202
  • 1
    Sure, in a "debug" build, but it would take a bad compiler for `(a >= 1)` to run slower than `(a > 0)`, since the former can be trivially transformed to the latter by the optimizer.. – BeeOnRope Jun 16 '16 at 02:22
  • 2
    @BeeOnRope Sometimes I am surprised what complicated things an optimizer can optimize and on what easy things it fails to do so. – glglgl Jun 16 '16 at 07:31
  • 1
    Indeed, and it's always worth checking the asm output for the very few functions where it would matter. That said, the above transformation is very basic and has been performed even in simple compilers for decades. – BeeOnRope Jun 16 '16 at 20:27
31

At the very least, if this were true a compiler could trivially optimise a <= b to !(a > b), and so even if the comparison itself were actually slower, with all but the most naive compiler you would not notice a difference.

Eliot Ball
  • 700
  • 5
  • 11
15

They have the same speed. Maybe in some special architecture what he/she said is right, but in the x86 family at least I know they are the same. Because for doing this the CPU will do a substraction (a - b) and then check the flags of the flag register. Two bits of that register are called ZF (zero Flag) and SF (sign flag), and it is done in one cycle, because it will do it with one mask operation.

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Masoud
  • 1,225
  • 6
  • 17
  • 29
14

This would be highly dependent on the underlying architecture that the C is compiled to. Some processors and architectures might have explicit instructions for equal to, or less than and equal to, which execute in different numbers of cycles.

That would be pretty unusual though, as the compiler could work around it, making it irrelevant.

Telgin
  • 1,539
  • 9
  • 10
  • 1
    IF there was a difference in the cyles. 1) it would not be detectable. 2) Any compiler worth its salt would already be making the transformation from the slow form to the faster form without changing the meaning of the code. So the resulting instruction planted would be identical. – Martin York Aug 27 '12 at 07:00
  • Agreed completely, it would be a pretty trivial and silly difference in any case. Certainly nothing to mention in a book that should be platform agnostic. – Telgin Aug 28 '12 at 03:46
  • @lttlrck: I get it. Took me a while (silly me). No they are not detectable because there are so many other things happening that make their measurement imposable. Processor stalls/ cache misses/ signals/ process swapping. Thus in a normal OS situation things on the single cycle level can not be physically measurable. If you can eliminate all that interference from you measurements (run it on a chip with on-board memory and no OS) then you still have granularity of your timers to worry about but theoretically if you run it long enough you could see something. – Martin York Aug 29 '12 at 06:57
13

TL;DR answer

For most combinations of architecture, compiler and language, < will not be faster than <=.

Full answer

Other answers have concentrated on x86 architecture, and I don't know the ARM architecture (which your example assembler seems to be) well enough to comment specifically on the code generated, but this is an example of a micro-optimisation which is very architecture specific, and is as likely to be an anti-optimisation as it is to be an optimisation.

As such, I would suggest that this sort of micro-optimisation is an example of cargo cult programming rather than best software engineering practice.

Counterexample

There are probably some architectures where this is an optimisation, but I know of at least one architecture where the opposite may be true. The venerable Transputer architecture only had machine code instructions for equal to and greater than or equal to, so all comparisons had to be built from these primitives.

Even then, in almost all cases, the compiler could order the evaluation instructions in such a way that in practice, no comparison had any advantage over any other. Worst case though, it might need to add a reverse instruction (REV) to swap the top two items on the operand stack. This was a single byte instruction which took a single cycle to run, so had the smallest overhead possible.

Summary

Whether or not a micro-optimisation like this is an optimisation or an anti-optimisation depends on the specific architecture you are using, so it is usually a bad idea to get into the habit of using architecture specific micro-optimisations, otherwise you might instinctively use one when it is inappropriate to do so, and it looks like this is exactly what the book you are reading is advocating.

Mark Booth
  • 6,794
  • 2
  • 60
  • 88
6

You should not be able to notice the difference even if there is any. Besides, in practice, you'll have to do an additional a + 1 or a - 1 to make the condition stand unless you're going to use some magic constants, which is a very bad practice by all means.

shinkou
  • 4,973
  • 1
  • 18
  • 28
  • 1
    What's the bad practice? Incrementing or decrementing a counter? How do you store index notation then? – jcolebrand Aug 27 '12 at 14:22
  • 5
    He means if you're doing comparison of 2 variable types. Of course it's trivial if you're setting the value for a loop or something. But if you have x <= y, and y is unknown, it would be slower to 'optimize' it to x < y + 1 – JustinDanielson Aug 27 '12 at 21:48
  • @JustinDanielson agreed. Not to mention ugly, confusing, etc. – Jonathon Reinhart Aug 27 '12 at 23:49
4

You could say that line is correct in most scripting languages, since the extra character results in slightly slower code processing. However, as the top answer pointed out, it should have no effect in C++, and anything being done with a scripting language probably isn't that concerned about optimization.

Ecksters
  • 1,150
  • 1
  • 10
  • 23
  • I somewhat disagree. In competitive programming, scripting languages often offer the quickest solution to a problem, but correct techniques (read: optimization) must be applied to get a correct solution. – Tyler Crompton Sep 05 '12 at 00:59
3

When I wrote the first version of this answer, I was only looking at the title question about < vs. <= in general, not the specific example of a constant a < 901 vs. a <= 900. Many compilers always shrink the magnitude of constants by converting between < and <=, e.g. because x86 immediate operand have a shorter 1-byte encoding for -128..127.

For ARM, being able to encode as an immediate depends on being able to rotate a narrow field into any position in a word. So cmp r0, #0x00f000 would be encodeable, while cmp r0, #0x00efff would not be. So the make-it-smaller rule for comparison vs. a compile-time constant doesn't always apply for ARM. AArch64 is either shift-by-12 or not, instead of an arbitrary rotation, for instructions like cmp and cmn, unlike 32-bit ARM and Thumb modes.


< vs. <= in general, including for runtime-variable conditions

In assembly language on most machines, a comparison for <= has the same cost as a comparison for <. This applies whether you're branching on it, booleanizing it to create a 0/1 integer, or using it as a predicate for a branchless select operation (like x86 CMOV). The other answers have only addressed this part of the question.

But this question is about the C++ operators, the input to the optimizer. Normally they're both equally efficient; the advice from the book sounds totally bogus because compilers can always transform the comparison that they implement in asm. But there is at least one exception where using <= can accidentally create something the compiler can't optimize.

As a loop condition, there are cases where <= is qualitatively different from <, when it stops the compiler from proving that a loop is not infinite. This can make a big difference, disabling auto-vectorization.

Unsigned overflow is well-defined as base-2 wrap around, unlike signed overflow (UB). Signed loop counters are generally safe from this with compilers that optimize based on signed-overflow UB not happening: ++i <= size will always eventually become false. (What Every C Programmer Should Know About Undefined Behavior)

void foo(unsigned size) {
    unsigned upper_bound = size - 1;  // or any calculation that could produce UINT_MAX
    for(unsigned i=0 ; i <= upper_bound ; i++)
        ...

Compilers can only optimize in ways that preserve the (defined and legally observable) behaviour of the C++ source for all possible input values, except ones that lead to undefined behaviour.

(A simple i <= size would create the problem too, but I thought calculating an upper bound was a more realistic example of accidentally introducing the possibility of an infinite loop for an input you don't care about but which the compiler must consider.)

In this case, size=0 leads to upper_bound=UINT_MAX, and i <= UINT_MAX is always true. So this loop is infinite for size=0, and the compiler has to respect that even though you as the programmer probably never intend to pass size=0. If the compiler can inline this function into a caller where it can prove that size=0 is impossible, then great, it can optimize like it could for i < size.

Asm like if(!size) skip the loop; do{...}while(--size); is one normally-efficient way to optimize a for( i<size ) loop, if the actual value of i isn't needed inside the loop (Why are loops always compiled into "do...while" style (tail jump)?).

But that do{}while can't be infinite: if entered with size==0, we get 2^n iterations. (Iterating over all unsigned integers in a for loop C makes it possible to express a loop over all unsigned integers including zero, but it's not easy without a carry flag the way it is in asm.)

With wraparound of the loop counter being a possibility, modern compilers often just "give up", and don't optimize nearly as aggressively.

Example: sum of integers from 1 to n

Using unsigned i <= n defeats clang's idiom-recognition that optimizes sum(1 .. n) loops with a closed form based on Gauss's n * (n+1) / 2 formula.

unsigned sum_1_to_n_finite(unsigned n) {
    unsigned total = 0;
    for (unsigned i = 0 ; i < n+1 ; ++i)
        total += i;
    return total;
}

x86-64 asm from clang7.0 and gcc8.2 on the Godbolt compiler explorer

 # clang7.0 -O3 closed-form
    cmp     edi, -1       # n passed in EDI: x86-64 System V calling convention
    je      .LBB1_1       # if (n == UINT_MAX) return 0;  // C++ loop runs 0 times
          # else fall through into the closed-form calc
    mov     ecx, edi         # zero-extend n into RCX
    lea     eax, [rdi - 1]   # n-1
    imul    rax, rcx         # n * (n-1)             # 64-bit
    shr     rax              # n * (n-1) / 2
    add     eax, edi         # n + (stuff / 2) = n * (n+1) / 2   # truncated to 32-bit
    ret          # computed without possible overflow of the product before right shifting
.LBB1_1:
    xor     eax, eax
    ret

But for the naive version, we just get a dumb loop from clang.

unsigned sum_1_to_n_naive(unsigned n) {
    unsigned total = 0;
    for (unsigned i = 0 ; i<=n ; ++i)
        total += i;
    return total;
}
# clang7.0 -O3
sum_1_to_n(unsigned int):
    xor     ecx, ecx           # i = 0
    xor     eax, eax           # retval = 0
.LBB0_1:                       # do {
    add     eax, ecx             # retval += i
    add     ecx, 1               # ++1
    cmp     ecx, edi
    jbe     .LBB0_1            # } while( i<n );
    ret

GCC doesn't use a closed-form either way, so the choice of loop condition doesn't really hurt it; it auto-vectorizes with SIMD integer addition, running 4 i values in parallel in the elements of an XMM register.

# "naive" inner loop
.L3:
    add     eax, 1       # do {
    paddd   xmm0, xmm1    # vect_total_4.6, vect_vec_iv_.5
    paddd   xmm1, xmm2    # vect_vec_iv_.5, tmp114
    cmp     edx, eax      # bnd.1, ivtmp.14     # bound and induction-variable tmp, I think.
    ja      .L3 #,       # }while( n > i )

 "finite" inner loop
  # before the loop:
  # xmm0 = 0 = totals
  # xmm1 = {0,1,2,3} = i
  # xmm2 = set1_epi32(4)
 .L13:                # do {
    add     eax, 1       # i++
    paddd   xmm0, xmm1    # total[0..3] += i[0..3]
    paddd   xmm1, xmm2    # i[0..3] += 4
    cmp     eax, edx
    jne     .L13      # }while( i != upper_limit );

     then horizontal sum xmm0
     and peeled cleanup for the last n%3 iterations, or something.
     

It also has a plain scalar loop which I think it uses for very small n, and/or for the infinite loop case.

BTW, both of these loops waste an instruction (and a uop on Sandybridge-family CPUs) on loop overhead. sub eax,1/jnz instead of add eax,1/cmp/jcc would be more efficient. 1 uop instead of 2 (after macro-fusion of sub/jcc or cmp/jcc). The code after both loops writes EAX unconditionally, so it's not using the final value of the loop counter.

Peter Cordes
  • 245,674
  • 35
  • 423
  • 606
  • Nice contrived example. What about your other comment about a potential effect on out of order execution due to EFLAGS use? Is it purely theoretical or can it actually happen that a JB leads to a better pipeline than a JBE? – rustyx Jan 20 '19 at 12:05
  • @rustyx: did I comment that somewhere under another answer? Compilers aren't going to emit code that causes partial-flag stalls, and certainly not for a C ` – Peter Cordes Jan 20 '19 at 12:23
  • As I understand it, the available immediates for `cmp` on AArch64 are simpler than your answer makes it sound: it takes a 12-bit immediate optionally shifted by 12 bits, so you can have `0xYYY` or `0xYYY000`, and you can also effectively negate the immediate by using `cmn` instead. This still supports your point, as `cmp w0, #0xf000` is encodeable and `cmp w0, #0xefff` is not. But the "rotate into any position" phrasing sounds more like a description of the "bitmask" immediates, which AFAIK are only available for bitwise logical instructions: `and, or, eor`, etc. – Nate Eldredge Nov 02 '20 at 06:10
  • @NateEldredge: I think my description fits perfectly for ARM mode, where it's an 8-bit field rotated by a multiple of 2. (so `0x1fe` isn't encodeable but `0xff0` is.) When I wrote this I hadn't understood the differences between AArch64 and ARM immediates, or that only bitwise boolean insns could use the bit-range / repeated bit-pattern encoding. (And `mov`; `or` with the zero reg is one way to take advantage of those encodings.) – Peter Cordes Nov 02 '20 at 07:07
3

Only if the people who created the computers are bad with boolean logic. Which they shouldn't be.

Every comparison (>= <= > <) can be done in the same speed.

What every comparison is, is just a subtraction (the difference) and seeing if it's positive/negative.
(If the msb is set, the number is negative)

How to check a >= b? Sub a-b >= 0 Check if a-b is positive.
How to check a <= b? Sub 0 <= b-a Check if b-a is positive.
How to check a < b? Sub a-b < 0 Check if a-b is negative.
How to check a > b? Sub 0 > b-a Check if b-a is negative.

Simply put, the computer can just do this underneath the hood for the given op:

a >= b == msb(a-b)==0
a <= b == msb(b-a)==0
a > b == msb(b-a)==1
a < b == msb(a-b)==1

and of course the computer wouldn't actually need to do the ==0 or ==1 either.
for the ==0 it could just invert the msb from the circuit.

Anyway, they most certainly wouldn't have made a >= b be calculated as a>b || a==b lol

Puddle
  • 1,963
  • 1
  • 11
  • 29
  • It's not that simple, though. For instance, if `a` is in a register and `b` is a compile-time constant, then x86 can compute `a-b` in one instruction (`sub rax, 12345` or `cmp`) but not `b-a`. There is an instruction for `reg - imm` but not the other way around. Many other machines have a similar situation. – Nate Eldredge Nov 02 '20 at 16:55