Is < faster than <=?












1437















I'm reading a book where the author says that if( a < 901 ) is faster than if( a <= 900 ).



Not exactly as in this simple example, but there are slight performance changes on loop complex code. I suppose this has to do something with generated machine code in case it's even true.










share|improve this question




















  • 126





    I see no reason why this question should be closed (and especially not deleted, as the votes are currently showing) given its historical significance, the quality of the answer, and the fact that the other top questions in performance remain open. At most it should be locked. Also, even if the question itself is misinformed/naive, the fact that it appeared in a book means that the original misinformation exists out there in "credible" sources somewhere, and this question is therefore constructive in that it helps to clear that up.

    – Jason C
    Mar 22 '14 at 23:49








  • 29





    You never did tell us which book you're referring to.

    – Jonathon Reinhart
    Jul 24 '14 at 19:47








  • 117





    Typing < is two times faster than typing <=.

    – Deqing
    Apr 21 '16 at 0:19






  • 3





    It was true on the 8086.

    – Joshua
    Nov 15 '16 at 16:28






  • 5





    The number of upvotes clearly shows that there are hundreds of people who heavily overoptimize.

    – m93a
    Feb 17 '18 at 13:50
















1437















I'm reading a book where the author says that if( a < 901 ) is faster than if( a <= 900 ).



Not exactly as in this simple example, but there are slight performance changes on loop complex code. I suppose this has to do something with generated machine code in case it's even true.










share|improve this question




















  • 126





    I see no reason why this question should be closed (and especially not deleted, as the votes are currently showing) given its historical significance, the quality of the answer, and the fact that the other top questions in performance remain open. At most it should be locked. Also, even if the question itself is misinformed/naive, the fact that it appeared in a book means that the original misinformation exists out there in "credible" sources somewhere, and this question is therefore constructive in that it helps to clear that up.

    – Jason C
    Mar 22 '14 at 23:49








  • 29





    You never did tell us which book you're referring to.

    – Jonathon Reinhart
    Jul 24 '14 at 19:47








  • 117





    Typing < is two times faster than typing <=.

    – Deqing
    Apr 21 '16 at 0:19






  • 3





    It was true on the 8086.

    – Joshua
    Nov 15 '16 at 16:28






  • 5





    The number of upvotes clearly shows that there are hundreds of people who heavily overoptimize.

    – m93a
    Feb 17 '18 at 13:50














1437












1437








1437


314






I'm reading a book where the author says that if( a < 901 ) is faster than if( a <= 900 ).



Not exactly as in this simple example, but there are slight performance changes on loop complex code. I suppose this has to do something with generated machine code in case it's even true.










share|improve this question
















I'm reading a book where the author says that if( a < 901 ) is faster than if( a <= 900 ).



Not exactly as in this simple example, but there are slight performance changes on loop complex code. I suppose this has to do something with generated machine code in case it's even true.







c++ performance assembly relational-operators






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 28 '16 at 15:26









ROMANIA_engineer

33.6k19150141




33.6k19150141










asked Aug 27 '12 at 2:10









Vinícius Magalhães HortaVinícius Magalhães Horta

6,49731841




6,49731841








  • 126





    I see no reason why this question should be closed (and especially not deleted, as the votes are currently showing) given its historical significance, the quality of the answer, and the fact that the other top questions in performance remain open. At most it should be locked. Also, even if the question itself is misinformed/naive, the fact that it appeared in a book means that the original misinformation exists out there in "credible" sources somewhere, and this question is therefore constructive in that it helps to clear that up.

    – Jason C
    Mar 22 '14 at 23:49








  • 29





    You never did tell us which book you're referring to.

    – Jonathon Reinhart
    Jul 24 '14 at 19:47








  • 117





    Typing < is two times faster than typing <=.

    – Deqing
    Apr 21 '16 at 0:19






  • 3





    It was true on the 8086.

    – Joshua
    Nov 15 '16 at 16:28






  • 5





    The number of upvotes clearly shows that there are hundreds of people who heavily overoptimize.

    – m93a
    Feb 17 '18 at 13:50














  • 126





    I see no reason why this question should be closed (and especially not deleted, as the votes are currently showing) given its historical significance, the quality of the answer, and the fact that the other top questions in performance remain open. At most it should be locked. Also, even if the question itself is misinformed/naive, the fact that it appeared in a book means that the original misinformation exists out there in "credible" sources somewhere, and this question is therefore constructive in that it helps to clear that up.

    – Jason C
    Mar 22 '14 at 23:49








  • 29





    You never did tell us which book you're referring to.

    – Jonathon Reinhart
    Jul 24 '14 at 19:47








  • 117





    Typing < is two times faster than typing <=.

    – Deqing
    Apr 21 '16 at 0:19






  • 3





    It was true on the 8086.

    – Joshua
    Nov 15 '16 at 16:28






  • 5





    The number of upvotes clearly shows that there are hundreds of people who heavily overoptimize.

    – m93a
    Feb 17 '18 at 13:50








126




126





I see no reason why this question should be closed (and especially not deleted, as the votes are currently showing) given its historical significance, the quality of the answer, and the fact that the other top questions in performance remain open. At most it should be locked. Also, even if the question itself is misinformed/naive, the fact that it appeared in a book means that the original misinformation exists out there in "credible" sources somewhere, and this question is therefore constructive in that it helps to clear that up.

– Jason C
Mar 22 '14 at 23:49







I see no reason why this question should be closed (and especially not deleted, as the votes are currently showing) given its historical significance, the quality of the answer, and the fact that the other top questions in performance remain open. At most it should be locked. Also, even if the question itself is misinformed/naive, the fact that it appeared in a book means that the original misinformation exists out there in "credible" sources somewhere, and this question is therefore constructive in that it helps to clear that up.

– Jason C
Mar 22 '14 at 23:49






29




29





You never did tell us which book you're referring to.

– Jonathon Reinhart
Jul 24 '14 at 19:47







You never did tell us which book you're referring to.

– Jonathon Reinhart
Jul 24 '14 at 19:47






117




117





Typing < is two times faster than typing <=.

– Deqing
Apr 21 '16 at 0:19





Typing < is two times faster than typing <=.

– Deqing
Apr 21 '16 at 0:19




3




3





It was true on the 8086.

– Joshua
Nov 15 '16 at 16:28





It was true on the 8086.

– Joshua
Nov 15 '16 at 16:28




5




5





The number of upvotes clearly shows that there are hundreds of people who heavily overoptimize.

– m93a
Feb 17 '18 at 13:50





The number of upvotes clearly shows that there are hundreds of people who heavily overoptimize.

– m93a
Feb 17 '18 at 13:50












13 Answers
13






active

oldest

votes


















1611














No, it will not be faster on most architectures. You didn't specify, but on x86, all of the integral comparisons will be typically implemented in two machine instructions:




  • A test or cmp instruction, which sets EFLAGS

  • And a Jcc (jump) instruction, depending on the comparison type (and code layout):


    • jne - Jump if not equal --> ZF = 0


    • jz - Jump if zero (equal) --> ZF = 1


    • jg - Jump if greater --> ZF = 0 and SF = OF

    • (etc...)






Example (Edited for brevity) Compiled with $ gcc -m32 -S -masm=intel test.c



    if (a < b) {
// Do something 1
}


Compiles to:



    mov     eax, DWORD PTR [esp+24]      ; a
cmp eax, DWORD PTR [esp+28] ; b
jge .L2 ; jump if a is >= b
; Do something 1
.L2:


And



    if (a <= b) {
// Do something 2
}


Compiles to:



    mov     eax, DWORD PTR [esp+24]      ; a
cmp eax, DWORD PTR [esp+28] ; b
jg .L5 ; jump if a is > b
; Do something 2
.L5:


So the only difference between the two is a jg versus a jge instruction. The two will take the same amount of time.





I'd like to address the comment that nothing indicates that the different jump instructions take the same amount of time. This one is a little tricky to answer, but here's what I can give: In the Intel Instruction Set Reference, they are all grouped together under one common instruction, Jcc (Jump if condition is met). The same grouping is made together under the Optimization Reference Manual, in Appendix C. Latency and Throughput.




Latency — The number of clock cycles that are required for the
execution core to complete the execution of all of the μops that form
an instruction.



Throughput — The number of clock cycles required to
wait before the issue ports are free to accept the same instruction
again. For many instructions, the throughput of an instruction can be
significantly less than its latency




The values for Jcc are:



      Latency   Throughput
Jcc N/A 0.5


with the following footnote on Jcc:




7) Selection of conditional jump instructions should be based on the recommendation of section Section 3.4.1, “Branch Prediction Optimization,” to improve the predictability of branches. When branches are predicted successfully, the latency of jcc is effectively zero.




So, nothing in the Intel docs ever treats one Jcc instruction any differently from the others.



If one thinks about the actual circuitry used to implement the instructions, one can assume that there would be simple AND/OR gates on the different bits in EFLAGS, to determine whether the conditions are met. There is then, no reason that an instruction testing two bits should take any more or less time than one testing only one (Ignoring gate propagation delay, which is much less than the clock period.)





Edit: Floating Point



This holds true for x87 floating point as well: (Pretty much same code as above, but with double instead of int.)



        fld     QWORD PTR [esp+32]
fld QWORD PTR [esp+40]
fucomip st, st(1) ; Compare ST(0) and ST(1), and set CF, PF, ZF in EFLAGS
fstp st(0)
seta al ; Set al if above (CF=0 and ZF=0).
test al, al
je .L2
; Do something 1
.L2:

fld QWORD PTR [esp+32]
fld QWORD PTR [esp+40]
fucomip st, st(1) ; (same thing as above)
fstp st(0)
setae al ; Set al if above or equal (CF=0).
test al, al
je .L5
; Do something 2
.L5:
leave
ret





share|improve this answer





















  • 228





    @Dyppl actually jg and jnle are the same instruction, 7F :-)

    – Jonathon Reinhart
    Aug 27 '12 at 5:30






  • 12





    Not to mention that the optimizer can modify the code if indeed one option is faster than the other.

    – Elazar Leibovich
    Aug 28 '12 at 6:33






  • 1





    just because something results in the same amount of instructions doesn't necessarily mean that the sum total time of executing all those instructions will be the same. Actually more instructions could be executed faster. Instructions per cycle is not a fixed number, it varies depending on the instructions.

    – jontejj
    May 31 '13 at 16:51






  • 16





    @jontejj I'm very much aware of that. Did you even read my answer? I didn't state anything about the same number of instructions, I stated that they are compiled to essentially the exact same instrutcions, except one jump instruction is looking at one flag, and the other jump instruction is looking at two flags. I believe I've given more than adequate evidence to show that they are semantically identical.

    – Jonathon Reinhart
    Jun 1 '13 at 5:22






  • 1





    @jontejj You make a very good point. For as much visibility as this answer gets, I should probably give it a little bit of a cleanup. Thanks for the feedback.

    – Jonathon Reinhart
    Jun 3 '13 at 14:20



















575














Historically (we're talking the 1980s and early 1990s), there were some architectures in which this was true. The root issue is that integer comparison is inherently implemented via integer subtractions. This gives rise to the following cases.



Comparison     Subtraction
---------- -----------
A < B --> A - B < 0
A = B --> A - B = 0
A > B --> A - B > 0


Now, when A < B the subtraction has to borrow a high-bit for the subtraction to be correct, just like you carry and borrow when adding and subtracting by hand. This "borrowed" bit was usually referred to as the carry bit and would be testable by a branch instruction. A second bit called the zero bit would be set if the subtraction were identically zero which implied equality.



There were usually at least two conditional branch instructions, one to branch on the carry bit and one on the zero bit.



Now, to get at the heart of the matter, let's expand the previous table to include the carry and zero bit results.



Comparison     Subtraction  Carry Bit  Zero Bit
---------- ----------- --------- --------
A < B --> A - B < 0 0 0
A = B --> A - B = 0 1 1
A > B --> A - B > 0 1 0


So, implementing a branch for A < B can be done in one instruction, because the carry bit is clear only in this case, , that is,



;; Implementation of "if (A < B) goto address;"
cmp A, B ;; compare A to B
bcz address ;; Branch if Carry is Zero to the new address


But, if we want to do a less-than-or-equal comparison, we need to do an additional check of the zero flag to catch the case of equality.



;; Implementation of "if (A <= B) goto address;"
cmp A, B ;; compare A to B
bcz address ;; branch if A < B
bzs address ;; also, Branch if the Zero bit is Set


So, on some machines, using a "less than" comparison might save one machine instruction. This was relevant in the era of sub-megahertz processor speed and 1:1 CPU-to-memory speed ratios, but it is almost totally irrelevant today.






share|improve this answer





















  • 10





    Additionally, architectures like x86 implement instructions like jge, which test both the zero and sign/carry flags.

    – greyfade
    Aug 27 '12 at 18:23








  • 130





    +1 for the historical perspective.

    – sbi
    Aug 27 '12 at 19:09






  • 7





    This is true on the 8080. It has instructions to jump on zero and jump on minus, but none that can test both simultaneously.

    – user597225
    Aug 27 '12 at 22:43








  • 41





    +1 This is the only answer that explains why the author might have written what he did.

    – Leo
    Aug 28 '12 at 7:25






  • 26





    Even on the 8080 a <= test can be implemented in one instruction with swapping the operands and testing for not < (equivalent to >=) This is the desired <= with swapped operands: cmp B,A; bcs addr. That's the reasoning this test was omitted by Intel, they considered it redundant and you couldn't afford redundant instructions at those times :-)

    – Gunther Piez
    Aug 29 '12 at 11:10





















85














Assuming we're talking about internal integer types, there's no possible way one could be faster than the other. They're obviously semantically identical. They both ask the compiler to do precisely the same thing. Only a horribly broken compiler would generate inferior code for one of these.



If there was some platform where < was faster than <= for simple integer types, the compiler should always convert <= to < for constants. Any compiler that didn't would just be a bad compiler (for that platform).






share|improve this answer





















  • 6





    +1 I agree. Neither < nor <= have speed until the compiler decides which speed they'll have. This is a very simple optimisation for compilers when you consider that they generally already perform dead code optimisation, tail call optimisation, loop hoisting (and unrolling, on occasions), automatic parallelisation of various loops, etc... Why waste time pondering premature optimisations? Get a prototype running, profile it to determine where the most significant optimisations lie, perform those optimisations in order of significance and profile again along the way to measure progress...

    – autistic
    Jun 10 '13 at 2:52











  • There are still some edge cases where a comparison having one constant value could be slower under <=, e.g., when the transformation from (a < C) to (a <= C-1) (for some constant C) causes C to be more difficult to encode in the instruction set. For example, an instruction set may be able to represent signed constants from -127 to 128 in a compact form in comparisons, but constants outside that range have to loaded using either a longer, slower encoding, or another instruction entirely. So a comparison like (a < -127) may not have a straightforward transformation.

    – BeeOnRope
    Jun 16 '16 at 2:18













  • @BeeOnRope The issue was not whether performing operations that differed due to having different constants in them could affect performance but whether expressing the same operation using different constants could affect performance. So we're not comparing a > 127 to a > 128 because you have no choice there, you use the one you need. We're comparing a > 127 to a >= 128, which can't require different encoding or different instructions because they have the same truth table. Any encoding of one is equally an encoding of the other.

    – David Schwartz
    Jun 16 '16 at 4:36













  • I was responding in a general way to your statement that "If there was some platform where [<= was slower] the compiler should always convert <= to < for constants". As far as I know, that transformation involves changing the constant. E.g., a <= 42 is compiled as a < 43 because < is faster. In some edge cases, such a transformation wouldn't be fruitful because the new constant may require more or slower instructions. Of course a > 127 and a >= 128 are equivalent and a compiler should encode both forms in the (same) fastest way, but that's not inconsistent with what I said.

    – BeeOnRope
    Jun 16 '16 at 20:36





















66














I see that neither is faster. The compiler generates the same machine code in each condition with a different value.



if(a < 901)
cmpl $900, -4(%rbp)
jg .L2

if(a <=901)
cmpl $901, -4(%rbp)
jg .L3


My example if is from GCC on x86_64 platform on Linux.



Compiler writers are pretty smart people, and they think of these things and many others most of us take for granted.



I noticed that if it is not a constant, then the same machine code is generated in either case.



int b;
if(a < b)
cmpl -4(%rbp), %eax
jge .L2

if(a <=b)
cmpl -4(%rbp), %eax
jg .L3





share|improve this answer





















  • 9





    Note that this is specific to x86.

    – Michael Petrotta
    Aug 27 '12 at 2:17






  • 10





    I think you should use that if(a <=900) to demonstrate that it generates exactly the same asm :)

    – Lipis
    Aug 27 '12 at 2:22








  • 2





    @AdrianCornish Sorry.. I edited it.. it's more or less the same.. but if you change the second if to <=900 then the asm code will be exactly the same :) It's pretty much the same now.. but you know.. for the OCD :)

    – Lipis
    Aug 27 '12 at 2:25






  • 3





    @Boann That might get reduced to if (true) and eliminated completely.

    – Qsario
    Aug 27 '12 at 2:32






  • 4





    No one has pointed out that this optimization only applies to constant comparisons. I can guarantee it will not be done like this for comparing two variables.

    – Jonathon Reinhart
    Aug 27 '12 at 3:05



















50














For floating point code, the <= comparison may indeed be slower (by one instruction) even on modern architectures. Here's the first function:



int compare_strict(double a, double b) { return a < b; }


On PowerPC, first this performs a floating point comparison (which updates cr, the condition register), then moves the condition register to a GPR, shifts the "compared less than" bit into place, and then returns. It takes four instructions.



Now consider this function instead:



int compare_loose(double a, double b) { return a <= b; }


This requires the same work as compare_strict above, but now there's two bits of interest: "was less than" and "was equal to." This requires an extra instruction (cror - condition register bitwise OR) to combine these two bits into one. So compare_loose requires five instructions, while compare_strict requires four.



You might think that the compiler could optimize the second function like so:



int compare_loose(double a, double b) { return ! (a > b); }


However this will incorrectly handle NaNs. NaN1 <= NaN2 and NaN1 > NaN2 need to both evaluate to false.






share|improve this answer


























  • Luckily it doesn't work like this on x86 (x87). fucomip sets ZF and CF.

    – Jonathon Reinhart
    Aug 27 '12 at 20:30






  • 3





    @JonathonReinhart: I think you're misunderstanding what the PowerPC is doing -- the condition register cr is the equivalent to flags like ZF and CF on the x86. (Although the CR is more flexible.) What the poster is talking about is moving the result to a GPR: which takes two instructions on PowerPC, but x86 has a conditional move instruction.

    – Dietrich Epp
    Aug 28 '12 at 6:19











  • @DietrichEpp What I meant to add after my statement was: Which you can then immediately jump based upon the value of EFLAGS. Sorry for not being clear.

    – Jonathon Reinhart
    Aug 28 '12 at 7:16






  • 1





    @JonathonReinhart: Yes, and you can also immediately jump based on the value of the CR. The answer is not talking about jumping, which is where the extra instructions come from.

    – Dietrich Epp
    Aug 28 '12 at 7:38



















34














Maybe the author of that unnamed book has read that a > 0 runs faster than a >= 1 and thinks that is true universally.



But it is because a 0 is involved (because CMP can, depending on the architecture, replaced e.g. with OR) and not because of the <.






share|improve this answer



















  • 1





    Sure, in a "debug" build, but it would take a bad compiler for (a >= 1) to run slower than (a > 0), since the former can be trivially transformed to the latter by the optimizer..

    – BeeOnRope
    Jun 16 '16 at 2:22








  • 1





    @BeeOnRope Sometimes I am surprised what complicated things an optimizer can optimize and on what easy things it fails to do so.

    – glglgl
    Jun 16 '16 at 7:31






  • 1





    Indeed, and it's always worth checking the asm output for the very few functions where it would matter. That said, the above transformation is very basic and has been performed even in simple compilers for decades.

    – BeeOnRope
    Jun 16 '16 at 20:27



















30














At the very least, if this were true a compiler could trivially optimise a <= b to !(a > b), and so even if the comparison itself were actually slower, with all but the most naive compiler you would not notice a difference.






share|improve this answer


























  • Why !(a>b) is optimised version of a <=b . Isnt !(a>b) 2 operation in one ?

    – Abhishek Singh
    Apr 7 '15 at 11:33






  • 3





    @AbhishekSingh NOT is just made by other instruction (je vs. jne)

    – Pavel Gatnar
    Apr 14 '15 at 16:03



















15














They have the same speed. Maybe in some special architecture what he/she said is right, but in the x86 family at least I know they are the same. Because for doing this the CPU will do a substraction (a - b) and then check the flags of the flag register. Two bits of that register are called ZF (zero Flag) and SF (sign flag), and it is done in one cycle, because it will do it with one mask operation.






share|improve this answer

































    13














    This would be highly dependent on the underlying architecture that the C is compiled to. Some processors and architectures might have explicit instructions for equal to, or less than and equal to, which execute in different numbers of cycles.



    That would be pretty unusual though, as the compiler could work around it, making it irrelevant.






    share|improve this answer



















    • 1





      IF there was a difference in the cyles. 1) it would not be detectable. 2) Any compiler worth its salt would already be making the transformation from the slow form to the faster form without changing the meaning of the code. So the resulting instruction planted would be identical.

      – Martin York
      Aug 27 '12 at 7:00











    • Agreed completely, it would be a pretty trivial and silly difference in any case. Certainly nothing to mention in a book that should be platform agnostic.

      – Telgin
      Aug 28 '12 at 3:46











    • @lttlrck: I get it. Took me a while (silly me). No they are not detectable because there are so many other things happening that make their measurement imposable. Processor stalls/ cache misses/ signals/ process swapping. Thus in a normal OS situation things on the single cycle level can not be physically measurable. If you can eliminate all that interference from you measurements (run it on a chip with on-board memory and no OS) then you still have granularity of your timers to worry about but theoretically if you run it long enough you could see something.

      – Martin York
      Aug 29 '12 at 6:57



















    11















    TL;DR answer



    For most combinations of architecture, compiler and language it will not be quicker.



    Full answer



    Other answers have concentrated on x86 architecture, and I don't know the ARM architecture (which your example assembler seems to be) well enough to comment specifically on the code generated, but this is an example of a micro-optimisation which is very architecture specific, and is as likely to be an anti-optimisation as it is to be an optimisation.



    As such, I would suggest that this sort of micro-optimisation is an example of cargo cult programming rather than best software engineering practice.



    There are probably some architectures where this is an optimisation, but I know of at least one architecture where the opposite may be true. The venerable Transputer architecture only had machine code instructions for equal to and greater than or equal to, so all comparisons had to be built from these primitives.



    Even then, in almost all cases, the compiler could order the evaluation instructions in such a way that in practice, no comparison had any advantage over any other. Worst case though, it might need to add a reverse instruction (REV) to swap the top two items on the operand stack. This was a single byte instruction which took a single cycle to run, so had the smallest overhead possible.



    Whether or not a micro-optimisation like this is an optimisation or an anti-optimisation depends on the specific architecture you are using, so it is usually a bad idea to get into the habit of using architecture specific micro-optimisations, otherwise you might instinctively use one when it is inappropriate to do so, and it looks like this is exactly what the book you are reading is advocating.






    share|improve this answer

































      6














      You should not be able to notice the difference even if there is any. Besides, in practice, you'll have to do an additional a + 1 or a - 1 to make the condition stand unless you're going to use some magic constants, which is a very bad practice by all means.






      share|improve this answer



















      • 1





        What's the bad practice? Incrementing or decrementing a counter? How do you store index notation then?

        – jcolebrand
        Aug 27 '12 at 14:22








      • 5





        He means if you're doing comparison of 2 variable types. Of course it's trivial if you're setting the value for a loop or something. But if you have x <= y, and y is unknown, it would be slower to 'optimize' it to x < y + 1

        – JustinDanielson
        Aug 27 '12 at 21:48













      • @JustinDanielson agreed. Not to mention ugly, confusing, etc.

        – Jonathon Reinhart
        Aug 27 '12 at 23:49



















      3














      You could say that line is correct in most scripting languages, since the extra character results in slightly slower code processing.
      However, as the top answer pointed out, it should have no effect in C++, and anything being done with a scripting language probably isn't that concerned about optimization.






      share|improve this answer
























      • I somewhat disagree. In competitive programming, scripting languages often offer the quickest solution to a problem, but correct techniques (read: optimization) must be applied to get a correct solution.

        – Tyler Crompton
        Sep 5 '12 at 0:59



















      1














      When I wrote this answer, I was only looking at the title question about < vs. <= in general, not the specific example of a constant a < 901 vs. a <= 900. Many compilers always shrink the magnitude of constants by converting between < and <=, e.g. because x86 immediate operand have a shorter 1-byte encoding for -128..127.



      For ARM and especially AArch64, being able to encode as an immediate depends on being able to rotate a narrow field into any position in a word. So cmp w0, #0x00f000 would be encodeable, while cmp w0, #0x00effff might not be. So the make-it-smaller rule for comparison vs. a compile-time constant doesn't always apply for AArch64.





      < vs. <= in general, including for runtime-variable conditions



      In assembly language on most machines, a comparison for <= has the same cost as a comparison for <. This applies whether you're branching on it, booleanizing it to create a 0/1 integer, or using it as a predicate for a branchless select operation (like x86 CMOV). The other answers have only addressed this part of the question.



      But this question is about the C++ operators, the input to the optimizer. Normally they're both equally efficient; the advice from the book sounds totally bogus because compilers can always transform the comparison that they implement in asm. But there is at least one exception where using <= can accidentally create something the compiler can't optimize.



      As a loop condition, there are cases where <= is qualitatively different from <, when it stops the compiler from proving that a loop is not infinite. This can make a big difference, disabling auto-vectorization.



      Unsigned overflow is well-defined as base-2 wrap around, unlike signed overflow (UB). Signed loop counters are generally safe from this with compilers that optimize based on signed-overflow UB not happening: ++i <= size will always eventually become false. (What Every C Programmer Should Know About Undefined Behavior)



      void foo(unsigned size) {
      unsigned upper_bound = size - 1; // or any calculation that could produce UINT_MAX
      for(unsigned i=0 ; i <= upper_bound ; i++)
      ...


      Compilers can only optimize in ways that preserve the (defined and legally observable) behaviour of the C++ source for all possible input values, except ones that lead to undefined behaviour.



      (A simple i <= size would create the problem too, but I thought calculating an upper bound was a more realistic example of accidentally introducing the possibility of an infinite loop for an input you don't care about but which the compiler must consider.)



      In this case, size=0 leads to upper_bound=UINT_MAX, and i <= UINT_MAX is always true. So this loop is infinite for size=0, and the compiler has to respect that even though you as the programmer probably never intend to pass size=0. If the compiler can inline this function into a caller where it can prove that size=0 is impossible, then great, it can optimize like it could for i < size.



      Asm like if(!size) skip the loop; do{...}while(--size); is one normally-efficient way to optimize a for( i<size ) loop, if the actual value of i isn't needed inside the loop (Why are loops always compiled into "do...while" style (tail jump)?).



      But that do{}while can't be infinite: if entered with size==0, we get 2^n iterations. (Iterating over all unsigned integers in a for loop C makes it possible to express a loop over all unsigned integers including zero, but it's not easy without a carry flag the way it is in asm.)



      With wraparound of the loop counter being a possibility, modern compilers often just "give up", and don't optimize nearly as aggressively.



      Example: sum of integers from 1 to n



      Using unsigned i <= n defeats clang's idiom-recognition that optimizes sum(1 .. n) loops with a closed form based on Gauss's n * (n+1) / 2 formula.



      unsigned sum_1_to_n_finite(unsigned n) {
      unsigned total = 0;
      for (unsigned i = 0 ; i < n+1 ; ++i)
      total += i;
      return total;
      }


      x86-64 asm from clang7.0 and gcc8.2 on the Godbolt compiler explorer



       # clang7.0 -O3 closed-form
      cmp edi, -1 # n passed in EDI: x86-64 System V calling convention
      je .LBB1_1 # if (n == UINT_MAX) return 0; // C++ loop runs 0 times
      # else fall through into the closed-form calc
      mov ecx, edi # zero-extend n into RCX
      lea eax, [rdi - 1] # n-1
      imul rax, rcx # n * (n-1) # 64-bit
      shr rax # n * (n-1) / 2
      add eax, edi # n + (stuff / 2) = n * (n+1) / 2 # truncated to 32-bit
      ret # computed without possible overflow of the product before right shifting
      .LBB1_1:
      xor eax, eax
      ret


      But for the naive version, we just get a dumb loop from clang.



      unsigned sum_1_to_n_naive(unsigned n) {
      unsigned total = 0;
      for (unsigned i = 0 ; i<=n ; ++i)
      total += i;
      return total;
      }


      # clang7.0 -O3
      sum_1_to_n(unsigned int):
      xor ecx, ecx # i = 0
      xor eax, eax # retval = 0
      .LBB0_1: # do {
      add eax, ecx # retval += i
      add ecx, 1 # ++1
      cmp ecx, edi
      jbe .LBB0_1 # } while( i<n );
      ret




      GCC doesn't use a closed-form either way, so the choice of loop condition doesn't really hurt it; it auto-vectorizes with SIMD integer addition, running 4 i values in parallel in the elements of an XMM register.



      # "naive" inner loop
      .L3:
      add eax, 1 # do {
      paddd xmm0, xmm1 # vect_total_4.6, vect_vec_iv_.5
      paddd xmm1, xmm2 # vect_vec_iv_.5, tmp114
      cmp edx, eax # bnd.1, ivtmp.14 # bound and induction-variable tmp, I think.
      ja .L3 #, # }while( n > i )

      "finite" inner loop
      # before the loop:
      # xmm0 = 0 = totals
      # xmm1 = {0,1,2,3} = i
      # xmm2 = set1_epi32(4)
      .L13: # do {
      add eax, 1 # i++
      paddd xmm0, xmm1 # total[0..3] += i[0..3]
      paddd xmm1, xmm2 # i[0..3] += 4
      cmp eax, edx
      jne .L13 # }while( i != upper_limit );

      then horizontal sum xmm0
      and peeled cleanup for the last n%3 iterations, or something.


      It also has a plain scalar loop which I think it uses for very small n, and/or for the infinite loop case.



      BTW, both of these loops waste an instruction (and a uop on Sandybridge-family CPUs) on loop overhead. sub eax,1/jnz instead of add eax,1/cmp/jcc would be more efficient. 1 uop instead of 2 (after macro-fusion of sub/jcc or cmp/jcc). The code after both loops writes EAX unconditionally, so it's not using the final value of the loop counter.






      share|improve this answer


























      • Nice contrived example. What about your other comment about a potential effect on out of order execution due to EFLAGS use? Is it purely theoretical or can it actually happen that a JB leads to a better pipeline than a JBE?

        – rustyx
        Jan 20 at 12:05











      • @rustyx: did I comment that somewhere under another answer? Compilers aren't going to emit code that causes partial-flag stalls, and certainly not for a C < or <=. But sure, test ecx,ecx / bt eax, 3 / jbe will jump if ZF is set (ecx==0), or if CF is set (bit 3 of EAX==1), causing a partial flag stall on most CPUs because the flags it reads don't all come from the last instruction to write any flags. On Sandybridge-family, it doesn't actually stall, just needs to insert a merging uop. cmp/test write all flags, but bt leaves ZF unmodified. felixcloutier.com/x86/bt

        – Peter Cordes
        Jan 20 at 12:23











      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f12135518%2fis-faster-than%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      13 Answers
      13






      active

      oldest

      votes








      13 Answers
      13






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      1611














      No, it will not be faster on most architectures. You didn't specify, but on x86, all of the integral comparisons will be typically implemented in two machine instructions:




      • A test or cmp instruction, which sets EFLAGS

      • And a Jcc (jump) instruction, depending on the comparison type (and code layout):


        • jne - Jump if not equal --> ZF = 0


        • jz - Jump if zero (equal) --> ZF = 1


        • jg - Jump if greater --> ZF = 0 and SF = OF

        • (etc...)






      Example (Edited for brevity) Compiled with $ gcc -m32 -S -masm=intel test.c



          if (a < b) {
      // Do something 1
      }


      Compiles to:



          mov     eax, DWORD PTR [esp+24]      ; a
      cmp eax, DWORD PTR [esp+28] ; b
      jge .L2 ; jump if a is >= b
      ; Do something 1
      .L2:


      And



          if (a <= b) {
      // Do something 2
      }


      Compiles to:



          mov     eax, DWORD PTR [esp+24]      ; a
      cmp eax, DWORD PTR [esp+28] ; b
      jg .L5 ; jump if a is > b
      ; Do something 2
      .L5:


      So the only difference between the two is a jg versus a jge instruction. The two will take the same amount of time.





      I'd like to address the comment that nothing indicates that the different jump instructions take the same amount of time. This one is a little tricky to answer, but here's what I can give: In the Intel Instruction Set Reference, they are all grouped together under one common instruction, Jcc (Jump if condition is met). The same grouping is made together under the Optimization Reference Manual, in Appendix C. Latency and Throughput.




      Latency — The number of clock cycles that are required for the
      execution core to complete the execution of all of the μops that form
      an instruction.



      Throughput — The number of clock cycles required to
      wait before the issue ports are free to accept the same instruction
      again. For many instructions, the throughput of an instruction can be
      significantly less than its latency




      The values for Jcc are:



            Latency   Throughput
      Jcc N/A 0.5


      with the following footnote on Jcc:




      7) Selection of conditional jump instructions should be based on the recommendation of section Section 3.4.1, “Branch Prediction Optimization,” to improve the predictability of branches. When branches are predicted successfully, the latency of jcc is effectively zero.




      So, nothing in the Intel docs ever treats one Jcc instruction any differently from the others.



      If one thinks about the actual circuitry used to implement the instructions, one can assume that there would be simple AND/OR gates on the different bits in EFLAGS, to determine whether the conditions are met. There is then, no reason that an instruction testing two bits should take any more or less time than one testing only one (Ignoring gate propagation delay, which is much less than the clock period.)





      Edit: Floating Point



      This holds true for x87 floating point as well: (Pretty much same code as above, but with double instead of int.)



              fld     QWORD PTR [esp+32]
      fld QWORD PTR [esp+40]
      fucomip st, st(1) ; Compare ST(0) and ST(1), and set CF, PF, ZF in EFLAGS
      fstp st(0)
      seta al ; Set al if above (CF=0 and ZF=0).
      test al, al
      je .L2
      ; Do something 1
      .L2:

      fld QWORD PTR [esp+32]
      fld QWORD PTR [esp+40]
      fucomip st, st(1) ; (same thing as above)
      fstp st(0)
      setae al ; Set al if above or equal (CF=0).
      test al, al
      je .L5
      ; Do something 2
      .L5:
      leave
      ret





      share|improve this answer





















      • 228





        @Dyppl actually jg and jnle are the same instruction, 7F :-)

        – Jonathon Reinhart
        Aug 27 '12 at 5:30






      • 12





        Not to mention that the optimizer can modify the code if indeed one option is faster than the other.

        – Elazar Leibovich
        Aug 28 '12 at 6:33






      • 1





        just because something results in the same amount of instructions doesn't necessarily mean that the sum total time of executing all those instructions will be the same. Actually more instructions could be executed faster. Instructions per cycle is not a fixed number, it varies depending on the instructions.

        – jontejj
        May 31 '13 at 16:51






      • 16





        @jontejj I'm very much aware of that. Did you even read my answer? I didn't state anything about the same number of instructions, I stated that they are compiled to essentially the exact same instrutcions, except one jump instruction is looking at one flag, and the other jump instruction is looking at two flags. I believe I've given more than adequate evidence to show that they are semantically identical.

        – Jonathon Reinhart
        Jun 1 '13 at 5:22






      • 1





        @jontejj You make a very good point. For as much visibility as this answer gets, I should probably give it a little bit of a cleanup. Thanks for the feedback.

        – Jonathon Reinhart
        Jun 3 '13 at 14:20
















      1611














      No, it will not be faster on most architectures. You didn't specify, but on x86, all of the integral comparisons will be typically implemented in two machine instructions:




      • A test or cmp instruction, which sets EFLAGS

      • And a Jcc (jump) instruction, depending on the comparison type (and code layout):


        • jne - Jump if not equal --> ZF = 0


        • jz - Jump if zero (equal) --> ZF = 1


        • jg - Jump if greater --> ZF = 0 and SF = OF

        • (etc...)






      Example (Edited for brevity) Compiled with $ gcc -m32 -S -masm=intel test.c



          if (a < b) {
      // Do something 1
      }


      Compiles to:



          mov     eax, DWORD PTR [esp+24]      ; a
      cmp eax, DWORD PTR [esp+28] ; b
      jge .L2 ; jump if a is >= b
      ; Do something 1
      .L2:


      And



          if (a <= b) {
      // Do something 2
      }


      Compiles to:



          mov     eax, DWORD PTR [esp+24]      ; a
      cmp eax, DWORD PTR [esp+28] ; b
      jg .L5 ; jump if a is > b
      ; Do something 2
      .L5:


      So the only difference between the two is a jg versus a jge instruction. The two will take the same amount of time.





      I'd like to address the comment that nothing indicates that the different jump instructions take the same amount of time. This one is a little tricky to answer, but here's what I can give: In the Intel Instruction Set Reference, they are all grouped together under one common instruction, Jcc (Jump if condition is met). The same grouping is made together under the Optimization Reference Manual, in Appendix C. Latency and Throughput.




      Latency — The number of clock cycles that are required for the
      execution core to complete the execution of all of the μops that form
      an instruction.



      Throughput — The number of clock cycles required to
      wait before the issue ports are free to accept the same instruction
      again. For many instructions, the throughput of an instruction can be
      significantly less than its latency




      The values for Jcc are:



            Latency   Throughput
      Jcc N/A 0.5


      with the following footnote on Jcc:




      7) Selection of conditional jump instructions should be based on the recommendation of section Section 3.4.1, “Branch Prediction Optimization,” to improve the predictability of branches. When branches are predicted successfully, the latency of jcc is effectively zero.




      So, nothing in the Intel docs ever treats one Jcc instruction any differently from the others.



      If one thinks about the actual circuitry used to implement the instructions, one can assume that there would be simple AND/OR gates on the different bits in EFLAGS, to determine whether the conditions are met. There is then, no reason that an instruction testing two bits should take any more or less time than one testing only one (Ignoring gate propagation delay, which is much less than the clock period.)





      Edit: Floating Point



      This holds true for x87 floating point as well: (Pretty much same code as above, but with double instead of int.)



              fld     QWORD PTR [esp+32]
      fld QWORD PTR [esp+40]
      fucomip st, st(1) ; Compare ST(0) and ST(1), and set CF, PF, ZF in EFLAGS
      fstp st(0)
      seta al ; Set al if above (CF=0 and ZF=0).
      test al, al
      je .L2
      ; Do something 1
      .L2:

      fld QWORD PTR [esp+32]
      fld QWORD PTR [esp+40]
      fucomip st, st(1) ; (same thing as above)
      fstp st(0)
      setae al ; Set al if above or equal (CF=0).
      test al, al
      je .L5
      ; Do something 2
      .L5:
      leave
      ret





      share|improve this answer





















      • 228





        @Dyppl actually jg and jnle are the same instruction, 7F :-)

        – Jonathon Reinhart
        Aug 27 '12 at 5:30






      • 12





        Not to mention that the optimizer can modify the code if indeed one option is faster than the other.

        – Elazar Leibovich
        Aug 28 '12 at 6:33






      • 1





        just because something results in the same amount of instructions doesn't necessarily mean that the sum total time of executing all those instructions will be the same. Actually more instructions could be executed faster. Instructions per cycle is not a fixed number, it varies depending on the instructions.

        – jontejj
        May 31 '13 at 16:51






      • 16





        @jontejj I'm very much aware of that. Did you even read my answer? I didn't state anything about the same number of instructions, I stated that they are compiled to essentially the exact same instrutcions, except one jump instruction is looking at one flag, and the other jump instruction is looking at two flags. I believe I've given more than adequate evidence to show that they are semantically identical.

        – Jonathon Reinhart
        Jun 1 '13 at 5:22






      • 1





        @jontejj You make a very good point. For as much visibility as this answer gets, I should probably give it a little bit of a cleanup. Thanks for the feedback.

        – Jonathon Reinhart
        Jun 3 '13 at 14:20














      1611












      1611








      1611







      No, it will not be faster on most architectures. You didn't specify, but on x86, all of the integral comparisons will be typically implemented in two machine instructions:




      • A test or cmp instruction, which sets EFLAGS

      • And a Jcc (jump) instruction, depending on the comparison type (and code layout):


        • jne - Jump if not equal --> ZF = 0


        • jz - Jump if zero (equal) --> ZF = 1


        • jg - Jump if greater --> ZF = 0 and SF = OF

        • (etc...)






      Example (Edited for brevity) Compiled with $ gcc -m32 -S -masm=intel test.c



          if (a < b) {
      // Do something 1
      }


      Compiles to:



          mov     eax, DWORD PTR [esp+24]      ; a
      cmp eax, DWORD PTR [esp+28] ; b
      jge .L2 ; jump if a is >= b
      ; Do something 1
      .L2:


      And



          if (a <= b) {
      // Do something 2
      }


      Compiles to:



          mov     eax, DWORD PTR [esp+24]      ; a
      cmp eax, DWORD PTR [esp+28] ; b
      jg .L5 ; jump if a is > b
      ; Do something 2
      .L5:


      So the only difference between the two is a jg versus a jge instruction. The two will take the same amount of time.





      I'd like to address the comment that nothing indicates that the different jump instructions take the same amount of time. This one is a little tricky to answer, but here's what I can give: In the Intel Instruction Set Reference, they are all grouped together under one common instruction, Jcc (Jump if condition is met). The same grouping is made together under the Optimization Reference Manual, in Appendix C. Latency and Throughput.




      Latency — The number of clock cycles that are required for the
      execution core to complete the execution of all of the μops that form
      an instruction.



      Throughput — The number of clock cycles required to
      wait before the issue ports are free to accept the same instruction
      again. For many instructions, the throughput of an instruction can be
      significantly less than its latency




      The values for Jcc are:



            Latency   Throughput
      Jcc N/A 0.5


      with the following footnote on Jcc:




      7) Selection of conditional jump instructions should be based on the recommendation of section Section 3.4.1, “Branch Prediction Optimization,” to improve the predictability of branches. When branches are predicted successfully, the latency of jcc is effectively zero.




      So, nothing in the Intel docs ever treats one Jcc instruction any differently from the others.



      If one thinks about the actual circuitry used to implement the instructions, one can assume that there would be simple AND/OR gates on the different bits in EFLAGS, to determine whether the conditions are met. There is then, no reason that an instruction testing two bits should take any more or less time than one testing only one (Ignoring gate propagation delay, which is much less than the clock period.)





      Edit: Floating Point



      This holds true for x87 floating point as well: (Pretty much same code as above, but with double instead of int.)



              fld     QWORD PTR [esp+32]
      fld QWORD PTR [esp+40]
      fucomip st, st(1) ; Compare ST(0) and ST(1), and set CF, PF, ZF in EFLAGS
      fstp st(0)
      seta al ; Set al if above (CF=0 and ZF=0).
      test al, al
      je .L2
      ; Do something 1
      .L2:

      fld QWORD PTR [esp+32]
      fld QWORD PTR [esp+40]
      fucomip st, st(1) ; (same thing as above)
      fstp st(0)
      setae al ; Set al if above or equal (CF=0).
      test al, al
      je .L5
      ; Do something 2
      .L5:
      leave
      ret





      share|improve this answer















      No, it will not be faster on most architectures. You didn't specify, but on x86, all of the integral comparisons will be typically implemented in two machine instructions:




      • A test or cmp instruction, which sets EFLAGS

      • And a Jcc (jump) instruction, depending on the comparison type (and code layout):


        • jne - Jump if not equal --> ZF = 0


        • jz - Jump if zero (equal) --> ZF = 1


        • jg - Jump if greater --> ZF = 0 and SF = OF

        • (etc...)






      Example (Edited for brevity) Compiled with $ gcc -m32 -S -masm=intel test.c



          if (a < b) {
      // Do something 1
      }


      Compiles to:



          mov     eax, DWORD PTR [esp+24]      ; a
      cmp eax, DWORD PTR [esp+28] ; b
      jge .L2 ; jump if a is >= b
      ; Do something 1
      .L2:


      And



          if (a <= b) {
      // Do something 2
      }


      Compiles to:



          mov     eax, DWORD PTR [esp+24]      ; a
      cmp eax, DWORD PTR [esp+28] ; b
      jg .L5 ; jump if a is > b
      ; Do something 2
      .L5:


      So the only difference between the two is a jg versus a jge instruction. The two will take the same amount of time.





      I'd like to address the comment that nothing indicates that the different jump instructions take the same amount of time. This one is a little tricky to answer, but here's what I can give: In the Intel Instruction Set Reference, they are all grouped together under one common instruction, Jcc (Jump if condition is met). The same grouping is made together under the Optimization Reference Manual, in Appendix C. Latency and Throughput.




      Latency — The number of clock cycles that are required for the
      execution core to complete the execution of all of the μops that form
      an instruction.



      Throughput — The number of clock cycles required to
      wait before the issue ports are free to accept the same instruction
      again. For many instructions, the throughput of an instruction can be
      significantly less than its latency




      The values for Jcc are:



            Latency   Throughput
      Jcc N/A 0.5


      with the following footnote on Jcc:




      7) Selection of conditional jump instructions should be based on the recommendation of section Section 3.4.1, “Branch Prediction Optimization,” to improve the predictability of branches. When branches are predicted successfully, the latency of jcc is effectively zero.




      So, nothing in the Intel docs ever treats one Jcc instruction any differently from the others.



      If one thinks about the actual circuitry used to implement the instructions, one can assume that there would be simple AND/OR gates on the different bits in EFLAGS, to determine whether the conditions are met. There is then, no reason that an instruction testing two bits should take any more or less time than one testing only one (Ignoring gate propagation delay, which is much less than the clock period.)





      Edit: Floating Point



      This holds true for x87 floating point as well: (Pretty much same code as above, but with double instead of int.)



              fld     QWORD PTR [esp+32]
      fld QWORD PTR [esp+40]
      fucomip st, st(1) ; Compare ST(0) and ST(1), and set CF, PF, ZF in EFLAGS
      fstp st(0)
      seta al ; Set al if above (CF=0 and ZF=0).
      test al, al
      je .L2
      ; Do something 1
      .L2:

      fld QWORD PTR [esp+32]
      fld QWORD PTR [esp+40]
      fucomip st, st(1) ; (same thing as above)
      fstp st(0)
      setae al ; Set al if above or equal (CF=0).
      test al, al
      je .L5
      ; Do something 2
      .L5:
      leave
      ret






      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Aug 27 '12 at 20:28

























      answered Aug 27 '12 at 2:13









      Jonathon ReinhartJonathon Reinhart

      90.7k20164232




      90.7k20164232








      • 228





        @Dyppl actually jg and jnle are the same instruction, 7F :-)

        – Jonathon Reinhart
        Aug 27 '12 at 5:30






      • 12





        Not to mention that the optimizer can modify the code if indeed one option is faster than the other.

        – Elazar Leibovich
        Aug 28 '12 at 6:33






      • 1





        just because something results in the same amount of instructions doesn't necessarily mean that the sum total time of executing all those instructions will be the same. Actually more instructions could be executed faster. Instructions per cycle is not a fixed number, it varies depending on the instructions.

        – jontejj
        May 31 '13 at 16:51






      • 16





        @jontejj I'm very much aware of that. Did you even read my answer? I didn't state anything about the same number of instructions, I stated that they are compiled to essentially the exact same instrutcions, except one jump instruction is looking at one flag, and the other jump instruction is looking at two flags. I believe I've given more than adequate evidence to show that they are semantically identical.

        – Jonathon Reinhart
        Jun 1 '13 at 5:22






      • 1





        @jontejj You make a very good point. For as much visibility as this answer gets, I should probably give it a little bit of a cleanup. Thanks for the feedback.

        – Jonathon Reinhart
        Jun 3 '13 at 14:20














      • 228





        @Dyppl actually jg and jnle are the same instruction, 7F :-)

        – Jonathon Reinhart
        Aug 27 '12 at 5:30






      • 12





        Not to mention that the optimizer can modify the code if indeed one option is faster than the other.

        – Elazar Leibovich
        Aug 28 '12 at 6:33






      • 1





        just because something results in the same amount of instructions doesn't necessarily mean that the sum total time of executing all those instructions will be the same. Actually more instructions could be executed faster. Instructions per cycle is not a fixed number, it varies depending on the instructions.

        – jontejj
        May 31 '13 at 16:51






      • 16





        @jontejj I'm very much aware of that. Did you even read my answer? I didn't state anything about the same number of instructions, I stated that they are compiled to essentially the exact same instrutcions, except one jump instruction is looking at one flag, and the other jump instruction is looking at two flags. I believe I've given more than adequate evidence to show that they are semantically identical.

        – Jonathon Reinhart
        Jun 1 '13 at 5:22






      • 1





        @jontejj You make a very good point. For as much visibility as this answer gets, I should probably give it a little bit of a cleanup. Thanks for the feedback.

        – Jonathon Reinhart
        Jun 3 '13 at 14:20








      228




      228





      @Dyppl actually jg and jnle are the same instruction, 7F :-)

      – Jonathon Reinhart
      Aug 27 '12 at 5:30





      @Dyppl actually jg and jnle are the same instruction, 7F :-)

      – Jonathon Reinhart
      Aug 27 '12 at 5:30




      12




      12





      Not to mention that the optimizer can modify the code if indeed one option is faster than the other.

      – Elazar Leibovich
      Aug 28 '12 at 6:33





      Not to mention that the optimizer can modify the code if indeed one option is faster than the other.

      – Elazar Leibovich
      Aug 28 '12 at 6:33




      1




      1





      just because something results in the same amount of instructions doesn't necessarily mean that the sum total time of executing all those instructions will be the same. Actually more instructions could be executed faster. Instructions per cycle is not a fixed number, it varies depending on the instructions.

      – jontejj
      May 31 '13 at 16:51





      just because something results in the same amount of instructions doesn't necessarily mean that the sum total time of executing all those instructions will be the same. Actually more instructions could be executed faster. Instructions per cycle is not a fixed number, it varies depending on the instructions.

      – jontejj
      May 31 '13 at 16:51




      16




      16





      @jontejj I'm very much aware of that. Did you even read my answer? I didn't state anything about the same number of instructions, I stated that they are compiled to essentially the exact same instrutcions, except one jump instruction is looking at one flag, and the other jump instruction is looking at two flags. I believe I've given more than adequate evidence to show that they are semantically identical.

      – Jonathon Reinhart
      Jun 1 '13 at 5:22





      @jontejj I'm very much aware of that. Did you even read my answer? I didn't state anything about the same number of instructions, I stated that they are compiled to essentially the exact same instrutcions, except one jump instruction is looking at one flag, and the other jump instruction is looking at two flags. I believe I've given more than adequate evidence to show that they are semantically identical.

      – Jonathon Reinhart
      Jun 1 '13 at 5:22




      1




      1





      @jontejj You make a very good point. For as much visibility as this answer gets, I should probably give it a little bit of a cleanup. Thanks for the feedback.

      – Jonathon Reinhart
      Jun 3 '13 at 14:20





      @jontejj You make a very good point. For as much visibility as this answer gets, I should probably give it a little bit of a cleanup. Thanks for the feedback.

      – Jonathon Reinhart
      Jun 3 '13 at 14:20













      575














      Historically (we're talking the 1980s and early 1990s), there were some architectures in which this was true. The root issue is that integer comparison is inherently implemented via integer subtractions. This gives rise to the following cases.



      Comparison     Subtraction
      ---------- -----------
      A < B --> A - B < 0
      A = B --> A - B = 0
      A > B --> A - B > 0


      Now, when A < B the subtraction has to borrow a high-bit for the subtraction to be correct, just like you carry and borrow when adding and subtracting by hand. This "borrowed" bit was usually referred to as the carry bit and would be testable by a branch instruction. A second bit called the zero bit would be set if the subtraction were identically zero which implied equality.



      There were usually at least two conditional branch instructions, one to branch on the carry bit and one on the zero bit.



      Now, to get at the heart of the matter, let's expand the previous table to include the carry and zero bit results.



      Comparison     Subtraction  Carry Bit  Zero Bit
      ---------- ----------- --------- --------
      A < B --> A - B < 0 0 0
      A = B --> A - B = 0 1 1
      A > B --> A - B > 0 1 0


      So, implementing a branch for A < B can be done in one instruction, because the carry bit is clear only in this case, , that is,



      ;; Implementation of "if (A < B) goto address;"
      cmp A, B ;; compare A to B
      bcz address ;; Branch if Carry is Zero to the new address


      But, if we want to do a less-than-or-equal comparison, we need to do an additional check of the zero flag to catch the case of equality.



      ;; Implementation of "if (A <= B) goto address;"
      cmp A, B ;; compare A to B
      bcz address ;; branch if A < B
      bzs address ;; also, Branch if the Zero bit is Set


      So, on some machines, using a "less than" comparison might save one machine instruction. This was relevant in the era of sub-megahertz processor speed and 1:1 CPU-to-memory speed ratios, but it is almost totally irrelevant today.






      share|improve this answer





















      • 10





        Additionally, architectures like x86 implement instructions like jge, which test both the zero and sign/carry flags.

        – greyfade
        Aug 27 '12 at 18:23








      • 130





        +1 for the historical perspective.

        – sbi
        Aug 27 '12 at 19:09






      • 7





        This is true on the 8080. It has instructions to jump on zero and jump on minus, but none that can test both simultaneously.

        – user597225
        Aug 27 '12 at 22:43








      • 41





        +1 This is the only answer that explains why the author might have written what he did.

        – Leo
        Aug 28 '12 at 7:25






      • 26





        Even on the 8080 a <= test can be implemented in one instruction with swapping the operands and testing for not < (equivalent to >=) This is the desired <= with swapped operands: cmp B,A; bcs addr. That's the reasoning this test was omitted by Intel, they considered it redundant and you couldn't afford redundant instructions at those times :-)

        – Gunther Piez
        Aug 29 '12 at 11:10


















      575














      Historically (we're talking the 1980s and early 1990s), there were some architectures in which this was true. The root issue is that integer comparison is inherently implemented via integer subtractions. This gives rise to the following cases.



      Comparison     Subtraction
      ---------- -----------
      A < B --> A - B < 0
      A = B --> A - B = 0
      A > B --> A - B > 0


      Now, when A < B the subtraction has to borrow a high-bit for the subtraction to be correct, just like you carry and borrow when adding and subtracting by hand. This "borrowed" bit was usually referred to as the carry bit and would be testable by a branch instruction. A second bit called the zero bit would be set if the subtraction were identically zero which implied equality.



      There were usually at least two conditional branch instructions, one to branch on the carry bit and one on the zero bit.



      Now, to get at the heart of the matter, let's expand the previous table to include the carry and zero bit results.



      Comparison     Subtraction  Carry Bit  Zero Bit
      ---------- ----------- --------- --------
      A < B --> A - B < 0 0 0
      A = B --> A - B = 0 1 1
      A > B --> A - B > 0 1 0


      So, implementing a branch for A < B can be done in one instruction, because the carry bit is clear only in this case, , that is,



      ;; Implementation of "if (A < B) goto address;"
      cmp A, B ;; compare A to B
      bcz address ;; Branch if Carry is Zero to the new address


      But, if we want to do a less-than-or-equal comparison, we need to do an additional check of the zero flag to catch the case of equality.



      ;; Implementation of "if (A <= B) goto address;"
      cmp A, B ;; compare A to B
      bcz address ;; branch if A < B
      bzs address ;; also, Branch if the Zero bit is Set


      So, on some machines, using a "less than" comparison might save one machine instruction. This was relevant in the era of sub-megahertz processor speed and 1:1 CPU-to-memory speed ratios, but it is almost totally irrelevant today.






      share|improve this answer





















      • 10





        Additionally, architectures like x86 implement instructions like jge, which test both the zero and sign/carry flags.

        – greyfade
        Aug 27 '12 at 18:23








      • 130





        +1 for the historical perspective.

        – sbi
        Aug 27 '12 at 19:09






      • 7





        This is true on the 8080. It has instructions to jump on zero and jump on minus, but none that can test both simultaneously.

        – user597225
        Aug 27 '12 at 22:43








      • 41





        +1 This is the only answer that explains why the author might have written what he did.

        – Leo
        Aug 28 '12 at 7:25






      • 26





        Even on the 8080 a <= test can be implemented in one instruction with swapping the operands and testing for not < (equivalent to >=) This is the desired <= with swapped operands: cmp B,A; bcs addr. That's the reasoning this test was omitted by Intel, they considered it redundant and you couldn't afford redundant instructions at those times :-)

        – Gunther Piez
        Aug 29 '12 at 11:10
















      575












      575








      575







      Historically (we're talking the 1980s and early 1990s), there were some architectures in which this was true. The root issue is that integer comparison is inherently implemented via integer subtractions. This gives rise to the following cases.



      Comparison     Subtraction
      ---------- -----------
      A < B --> A - B < 0
      A = B --> A - B = 0
      A > B --> A - B > 0


      Now, when A < B the subtraction has to borrow a high-bit for the subtraction to be correct, just like you carry and borrow when adding and subtracting by hand. This "borrowed" bit was usually referred to as the carry bit and would be testable by a branch instruction. A second bit called the zero bit would be set if the subtraction were identically zero which implied equality.



      There were usually at least two conditional branch instructions, one to branch on the carry bit and one on the zero bit.



      Now, to get at the heart of the matter, let's expand the previous table to include the carry and zero bit results.



      Comparison     Subtraction  Carry Bit  Zero Bit
      ---------- ----------- --------- --------
      A < B --> A - B < 0 0 0
      A = B --> A - B = 0 1 1
      A > B --> A - B > 0 1 0


      So, implementing a branch for A < B can be done in one instruction, because the carry bit is clear only in this case, , that is,



      ;; Implementation of "if (A < B) goto address;"
      cmp A, B ;; compare A to B
      bcz address ;; Branch if Carry is Zero to the new address


      But, if we want to do a less-than-or-equal comparison, we need to do an additional check of the zero flag to catch the case of equality.



      ;; Implementation of "if (A <= B) goto address;"
      cmp A, B ;; compare A to B
      bcz address ;; branch if A < B
      bzs address ;; also, Branch if the Zero bit is Set


      So, on some machines, using a "less than" comparison might save one machine instruction. This was relevant in the era of sub-megahertz processor speed and 1:1 CPU-to-memory speed ratios, but it is almost totally irrelevant today.






      share|improve this answer















      Historically (we're talking the 1980s and early 1990s), there were some architectures in which this was true. The root issue is that integer comparison is inherently implemented via integer subtractions. This gives rise to the following cases.



      Comparison     Subtraction
      ---------- -----------
      A < B --> A - B < 0
      A = B --> A - B = 0
      A > B --> A - B > 0


      Now, when A < B the subtraction has to borrow a high-bit for the subtraction to be correct, just like you carry and borrow when adding and subtracting by hand. This "borrowed" bit was usually referred to as the carry bit and would be testable by a branch instruction. A second bit called the zero bit would be set if the subtraction were identically zero which implied equality.



      There were usually at least two conditional branch instructions, one to branch on the carry bit and one on the zero bit.



      Now, to get at the heart of the matter, let's expand the previous table to include the carry and zero bit results.



      Comparison     Subtraction  Carry Bit  Zero Bit
      ---------- ----------- --------- --------
      A < B --> A - B < 0 0 0
      A = B --> A - B = 0 1 1
      A > B --> A - B > 0 1 0


      So, implementing a branch for A < B can be done in one instruction, because the carry bit is clear only in this case, , that is,



      ;; Implementation of "if (A < B) goto address;"
      cmp A, B ;; compare A to B
      bcz address ;; Branch if Carry is Zero to the new address


      But, if we want to do a less-than-or-equal comparison, we need to do an additional check of the zero flag to catch the case of equality.



      ;; Implementation of "if (A <= B) goto address;"
      cmp A, B ;; compare A to B
      bcz address ;; branch if A < B
      bzs address ;; also, Branch if the Zero bit is Set


      So, on some machines, using a "less than" comparison might save one machine instruction. This was relevant in the era of sub-megahertz processor speed and 1:1 CPU-to-memory speed ratios, but it is almost totally irrelevant today.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Nov 22 '12 at 5:54









      Peter Mortensen

      13.6k1985111




      13.6k1985111










      answered Aug 27 '12 at 17:53









      LucasLucas

      7,02212244




      7,02212244








      • 10





        Additionally, architectures like x86 implement instructions like jge, which test both the zero and sign/carry flags.

        – greyfade
        Aug 27 '12 at 18:23








      • 130





        +1 for the historical perspective.

        – sbi
        Aug 27 '12 at 19:09






      • 7





        This is true on the 8080. It has instructions to jump on zero and jump on minus, but none that can test both simultaneously.

        – user597225
        Aug 27 '12 at 22:43








      • 41





        +1 This is the only answer that explains why the author might have written what he did.

        – Leo
        Aug 28 '12 at 7:25






      • 26





        Even on the 8080 a <= test can be implemented in one instruction with swapping the operands and testing for not < (equivalent to >=) This is the desired <= with swapped operands: cmp B,A; bcs addr. That's the reasoning this test was omitted by Intel, they considered it redundant and you couldn't afford redundant instructions at those times :-)

        – Gunther Piez
        Aug 29 '12 at 11:10
















      • 10





        Additionally, architectures like x86 implement instructions like jge, which test both the zero and sign/carry flags.

        – greyfade
        Aug 27 '12 at 18:23








      • 130





        +1 for the historical perspective.

        – sbi
        Aug 27 '12 at 19:09






      • 7





        This is true on the 8080. It has instructions to jump on zero and jump on minus, but none that can test both simultaneously.

        – user597225
        Aug 27 '12 at 22:43








      • 41





        +1 This is the only answer that explains why the author might have written what he did.

        – Leo
        Aug 28 '12 at 7:25






      • 26





        Even on the 8080 a <= test can be implemented in one instruction with swapping the operands and testing for not < (equivalent to >=) This is the desired <= with swapped operands: cmp B,A; bcs addr. That's the reasoning this test was omitted by Intel, they considered it redundant and you couldn't afford redundant instructions at those times :-)

        – Gunther Piez
        Aug 29 '12 at 11:10










      10




      10





      Additionally, architectures like x86 implement instructions like jge, which test both the zero and sign/carry flags.

      – greyfade
      Aug 27 '12 at 18:23







      Additionally, architectures like x86 implement instructions like jge, which test both the zero and sign/carry flags.

      – greyfade
      Aug 27 '12 at 18:23






      130




      130





      +1 for the historical perspective.

      – sbi
      Aug 27 '12 at 19:09





      +1 for the historical perspective.

      – sbi
      Aug 27 '12 at 19:09




      7




      7





      This is true on the 8080. It has instructions to jump on zero and jump on minus, but none that can test both simultaneously.

      – user597225
      Aug 27 '12 at 22:43







      This is true on the 8080. It has instructions to jump on zero and jump on minus, but none that can test both simultaneously.

      – user597225
      Aug 27 '12 at 22:43






      41




      41





      +1 This is the only answer that explains why the author might have written what he did.

      – Leo
      Aug 28 '12 at 7:25





      +1 This is the only answer that explains why the author might have written what he did.

      – Leo
      Aug 28 '12 at 7:25




      26




      26





      Even on the 8080 a <= test can be implemented in one instruction with swapping the operands and testing for not < (equivalent to >=) This is the desired <= with swapped operands: cmp B,A; bcs addr. That's the reasoning this test was omitted by Intel, they considered it redundant and you couldn't afford redundant instructions at those times :-)

      – Gunther Piez
      Aug 29 '12 at 11:10







      Even on the 8080 a <= test can be implemented in one instruction with swapping the operands and testing for not < (equivalent to >=) This is the desired <= with swapped operands: cmp B,A; bcs addr. That's the reasoning this test was omitted by Intel, they considered it redundant and you couldn't afford redundant instructions at those times :-)

      – Gunther Piez
      Aug 29 '12 at 11:10













      85














      Assuming we're talking about internal integer types, there's no possible way one could be faster than the other. They're obviously semantically identical. They both ask the compiler to do precisely the same thing. Only a horribly broken compiler would generate inferior code for one of these.



      If there was some platform where < was faster than <= for simple integer types, the compiler should always convert <= to < for constants. Any compiler that didn't would just be a bad compiler (for that platform).






      share|improve this answer





















      • 6





        +1 I agree. Neither < nor <= have speed until the compiler decides which speed they'll have. This is a very simple optimisation for compilers when you consider that they generally already perform dead code optimisation, tail call optimisation, loop hoisting (and unrolling, on occasions), automatic parallelisation of various loops, etc... Why waste time pondering premature optimisations? Get a prototype running, profile it to determine where the most significant optimisations lie, perform those optimisations in order of significance and profile again along the way to measure progress...

        – autistic
        Jun 10 '13 at 2:52











      • There are still some edge cases where a comparison having one constant value could be slower under <=, e.g., when the transformation from (a < C) to (a <= C-1) (for some constant C) causes C to be more difficult to encode in the instruction set. For example, an instruction set may be able to represent signed constants from -127 to 128 in a compact form in comparisons, but constants outside that range have to loaded using either a longer, slower encoding, or another instruction entirely. So a comparison like (a < -127) may not have a straightforward transformation.

        – BeeOnRope
        Jun 16 '16 at 2:18













      • @BeeOnRope The issue was not whether performing operations that differed due to having different constants in them could affect performance but whether expressing the same operation using different constants could affect performance. So we're not comparing a > 127 to a > 128 because you have no choice there, you use the one you need. We're comparing a > 127 to a >= 128, which can't require different encoding or different instructions because they have the same truth table. Any encoding of one is equally an encoding of the other.

        – David Schwartz
        Jun 16 '16 at 4:36













      • I was responding in a general way to your statement that "If there was some platform where [<= was slower] the compiler should always convert <= to < for constants". As far as I know, that transformation involves changing the constant. E.g., a <= 42 is compiled as a < 43 because < is faster. In some edge cases, such a transformation wouldn't be fruitful because the new constant may require more or slower instructions. Of course a > 127 and a >= 128 are equivalent and a compiler should encode both forms in the (same) fastest way, but that's not inconsistent with what I said.

        – BeeOnRope
        Jun 16 '16 at 20:36


















      85














      Assuming we're talking about internal integer types, there's no possible way one could be faster than the other. They're obviously semantically identical. They both ask the compiler to do precisely the same thing. Only a horribly broken compiler would generate inferior code for one of these.



      If there was some platform where < was faster than <= for simple integer types, the compiler should always convert <= to < for constants. Any compiler that didn't would just be a bad compiler (for that platform).






      share|improve this answer





















      • 6





        +1 I agree. Neither < nor <= have speed until the compiler decides which speed they'll have. This is a very simple optimisation for compilers when you consider that they generally already perform dead code optimisation, tail call optimisation, loop hoisting (and unrolling, on occasions), automatic parallelisation of various loops, etc... Why waste time pondering premature optimisations? Get a prototype running, profile it to determine where the most significant optimisations lie, perform those optimisations in order of significance and profile again along the way to measure progress...

        – autistic
        Jun 10 '13 at 2:52











      • There are still some edge cases where a comparison having one constant value could be slower under <=, e.g., when the transformation from (a < C) to (a <= C-1) (for some constant C) causes C to be more difficult to encode in the instruction set. For example, an instruction set may be able to represent signed constants from -127 to 128 in a compact form in comparisons, but constants outside that range have to loaded using either a longer, slower encoding, or another instruction entirely. So a comparison like (a < -127) may not have a straightforward transformation.

        – BeeOnRope
        Jun 16 '16 at 2:18













      • @BeeOnRope The issue was not whether performing operations that differed due to having different constants in them could affect performance but whether expressing the same operation using different constants could affect performance. So we're not comparing a > 127 to a > 128 because you have no choice there, you use the one you need. We're comparing a > 127 to a >= 128, which can't require different encoding or different instructions because they have the same truth table. Any encoding of one is equally an encoding of the other.

        – David Schwartz
        Jun 16 '16 at 4:36













      • I was responding in a general way to your statement that "If there was some platform where [<= was slower] the compiler should always convert <= to < for constants". As far as I know, that transformation involves changing the constant. E.g., a <= 42 is compiled as a < 43 because < is faster. In some edge cases, such a transformation wouldn't be fruitful because the new constant may require more or slower instructions. Of course a > 127 and a >= 128 are equivalent and a compiler should encode both forms in the (same) fastest way, but that's not inconsistent with what I said.

        – BeeOnRope
        Jun 16 '16 at 20:36
















      85












      85








      85







      Assuming we're talking about internal integer types, there's no possible way one could be faster than the other. They're obviously semantically identical. They both ask the compiler to do precisely the same thing. Only a horribly broken compiler would generate inferior code for one of these.



      If there was some platform where < was faster than <= for simple integer types, the compiler should always convert <= to < for constants. Any compiler that didn't would just be a bad compiler (for that platform).






      share|improve this answer















      Assuming we're talking about internal integer types, there's no possible way one could be faster than the other. They're obviously semantically identical. They both ask the compiler to do precisely the same thing. Only a horribly broken compiler would generate inferior code for one of these.



      If there was some platform where < was faster than <= for simple integer types, the compiler should always convert <= to < for constants. Any compiler that didn't would just be a bad compiler (for that platform).







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Aug 27 '12 at 2:58

























      answered Aug 27 '12 at 2:31









      David SchwartzDavid Schwartz

      137k14144225




      137k14144225








      • 6





        +1 I agree. Neither < nor <= have speed until the compiler decides which speed they'll have. This is a very simple optimisation for compilers when you consider that they generally already perform dead code optimisation, tail call optimisation, loop hoisting (and unrolling, on occasions), automatic parallelisation of various loops, etc... Why waste time pondering premature optimisations? Get a prototype running, profile it to determine where the most significant optimisations lie, perform those optimisations in order of significance and profile again along the way to measure progress...

        – autistic
        Jun 10 '13 at 2:52











      • There are still some edge cases where a comparison having one constant value could be slower under <=, e.g., when the transformation from (a < C) to (a <= C-1) (for some constant C) causes C to be more difficult to encode in the instruction set. For example, an instruction set may be able to represent signed constants from -127 to 128 in a compact form in comparisons, but constants outside that range have to loaded using either a longer, slower encoding, or another instruction entirely. So a comparison like (a < -127) may not have a straightforward transformation.

        – BeeOnRope
        Jun 16 '16 at 2:18













      • @BeeOnRope The issue was not whether performing operations that differed due to having different constants in them could affect performance but whether expressing the same operation using different constants could affect performance. So we're not comparing a > 127 to a > 128 because you have no choice there, you use the one you need. We're comparing a > 127 to a >= 128, which can't require different encoding or different instructions because they have the same truth table. Any encoding of one is equally an encoding of the other.

        – David Schwartz
        Jun 16 '16 at 4:36













      • I was responding in a general way to your statement that "If there was some platform where [<= was slower] the compiler should always convert <= to < for constants". As far as I know, that transformation involves changing the constant. E.g., a <= 42 is compiled as a < 43 because < is faster. In some edge cases, such a transformation wouldn't be fruitful because the new constant may require more or slower instructions. Of course a > 127 and a >= 128 are equivalent and a compiler should encode both forms in the (same) fastest way, but that's not inconsistent with what I said.

        – BeeOnRope
        Jun 16 '16 at 20:36
















      • 6





        +1 I agree. Neither < nor <= have speed until the compiler decides which speed they'll have. This is a very simple optimisation for compilers when you consider that they generally already perform dead code optimisation, tail call optimisation, loop hoisting (and unrolling, on occasions), automatic parallelisation of various loops, etc... Why waste time pondering premature optimisations? Get a prototype running, profile it to determine where the most significant optimisations lie, perform those optimisations in order of significance and profile again along the way to measure progress...

        – autistic
        Jun 10 '13 at 2:52











      • There are still some edge cases where a comparison having one constant value could be slower under <=, e.g., when the transformation from (a < C) to (a <= C-1) (for some constant C) causes C to be more difficult to encode in the instruction set. For example, an instruction set may be able to represent signed constants from -127 to 128 in a compact form in comparisons, but constants outside that range have to loaded using either a longer, slower encoding, or another instruction entirely. So a comparison like (a < -127) may not have a straightforward transformation.

        – BeeOnRope
        Jun 16 '16 at 2:18













      • @BeeOnRope The issue was not whether performing operations that differed due to having different constants in them could affect performance but whether expressing the same operation using different constants could affect performance. So we're not comparing a > 127 to a > 128 because you have no choice there, you use the one you need. We're comparing a > 127 to a >= 128, which can't require different encoding or different instructions because they have the same truth table. Any encoding of one is equally an encoding of the other.

        – David Schwartz
        Jun 16 '16 at 4:36













      • I was responding in a general way to your statement that "If there was some platform where [<= was slower] the compiler should always convert <= to < for constants". As far as I know, that transformation involves changing the constant. E.g., a <= 42 is compiled as a < 43 because < is faster. In some edge cases, such a transformation wouldn't be fruitful because the new constant may require more or slower instructions. Of course a > 127 and a >= 128 are equivalent and a compiler should encode both forms in the (same) fastest way, but that's not inconsistent with what I said.

        – BeeOnRope
        Jun 16 '16 at 20:36










      6




      6





      +1 I agree. Neither < nor <= have speed until the compiler decides which speed they'll have. This is a very simple optimisation for compilers when you consider that they generally already perform dead code optimisation, tail call optimisation, loop hoisting (and unrolling, on occasions), automatic parallelisation of various loops, etc... Why waste time pondering premature optimisations? Get a prototype running, profile it to determine where the most significant optimisations lie, perform those optimisations in order of significance and profile again along the way to measure progress...

      – autistic
      Jun 10 '13 at 2:52





      +1 I agree. Neither < nor <= have speed until the compiler decides which speed they'll have. This is a very simple optimisation for compilers when you consider that they generally already perform dead code optimisation, tail call optimisation, loop hoisting (and unrolling, on occasions), automatic parallelisation of various loops, etc... Why waste time pondering premature optimisations? Get a prototype running, profile it to determine where the most significant optimisations lie, perform those optimisations in order of significance and profile again along the way to measure progress...

      – autistic
      Jun 10 '13 at 2:52













      There are still some edge cases where a comparison having one constant value could be slower under <=, e.g., when the transformation from (a < C) to (a <= C-1) (for some constant C) causes C to be more difficult to encode in the instruction set. For example, an instruction set may be able to represent signed constants from -127 to 128 in a compact form in comparisons, but constants outside that range have to loaded using either a longer, slower encoding, or another instruction entirely. So a comparison like (a < -127) may not have a straightforward transformation.

      – BeeOnRope
      Jun 16 '16 at 2:18







      There are still some edge cases where a comparison having one constant value could be slower under <=, e.g., when the transformation from (a < C) to (a <= C-1) (for some constant C) causes C to be more difficult to encode in the instruction set. For example, an instruction set may be able to represent signed constants from -127 to 128 in a compact form in comparisons, but constants outside that range have to loaded using either a longer, slower encoding, or another instruction entirely. So a comparison like (a < -127) may not have a straightforward transformation.

      – BeeOnRope
      Jun 16 '16 at 2:18















      @BeeOnRope The issue was not whether performing operations that differed due to having different constants in them could affect performance but whether expressing the same operation using different constants could affect performance. So we're not comparing a > 127 to a > 128 because you have no choice there, you use the one you need. We're comparing a > 127 to a >= 128, which can't require different encoding or different instructions because they have the same truth table. Any encoding of one is equally an encoding of the other.

      – David Schwartz
      Jun 16 '16 at 4:36







      @BeeOnRope The issue was not whether performing operations that differed due to having different constants in them could affect performance but whether expressing the same operation using different constants could affect performance. So we're not comparing a > 127 to a > 128 because you have no choice there, you use the one you need. We're comparing a > 127 to a >= 128, which can't require different encoding or different instructions because they have the same truth table. Any encoding of one is equally an encoding of the other.

      – David Schwartz
      Jun 16 '16 at 4:36















      I was responding in a general way to your statement that "If there was some platform where [<= was slower] the compiler should always convert <= to < for constants". As far as I know, that transformation involves changing the constant. E.g., a <= 42 is compiled as a < 43 because < is faster. In some edge cases, such a transformation wouldn't be fruitful because the new constant may require more or slower instructions. Of course a > 127 and a >= 128 are equivalent and a compiler should encode both forms in the (same) fastest way, but that's not inconsistent with what I said.

      – BeeOnRope
      Jun 16 '16 at 20:36







      I was responding in a general way to your statement that "If there was some platform where [<= was slower] the compiler should always convert <= to < for constants". As far as I know, that transformation involves changing the constant. E.g., a <= 42 is compiled as a < 43 because < is faster. In some edge cases, such a transformation wouldn't be fruitful because the new constant may require more or slower instructions. Of course a > 127 and a >= 128 are equivalent and a compiler should encode both forms in the (same) fastest way, but that's not inconsistent with what I said.

      – BeeOnRope
      Jun 16 '16 at 20:36













      66














      I see that neither is faster. The compiler generates the same machine code in each condition with a different value.



      if(a < 901)
      cmpl $900, -4(%rbp)
      jg .L2

      if(a <=901)
      cmpl $901, -4(%rbp)
      jg .L3


      My example if is from GCC on x86_64 platform on Linux.



      Compiler writers are pretty smart people, and they think of these things and many others most of us take for granted.



      I noticed that if it is not a constant, then the same machine code is generated in either case.



      int b;
      if(a < b)
      cmpl -4(%rbp), %eax
      jge .L2

      if(a <=b)
      cmpl -4(%rbp), %eax
      jg .L3





      share|improve this answer





















      • 9





        Note that this is specific to x86.

        – Michael Petrotta
        Aug 27 '12 at 2:17






      • 10





        I think you should use that if(a <=900) to demonstrate that it generates exactly the same asm :)

        – Lipis
        Aug 27 '12 at 2:22








      • 2





        @AdrianCornish Sorry.. I edited it.. it's more or less the same.. but if you change the second if to <=900 then the asm code will be exactly the same :) It's pretty much the same now.. but you know.. for the OCD :)

        – Lipis
        Aug 27 '12 at 2:25






      • 3





        @Boann That might get reduced to if (true) and eliminated completely.

        – Qsario
        Aug 27 '12 at 2:32






      • 4





        No one has pointed out that this optimization only applies to constant comparisons. I can guarantee it will not be done like this for comparing two variables.

        – Jonathon Reinhart
        Aug 27 '12 at 3:05
















      66














      I see that neither is faster. The compiler generates the same machine code in each condition with a different value.



      if(a < 901)
      cmpl $900, -4(%rbp)
      jg .L2

      if(a <=901)
      cmpl $901, -4(%rbp)
      jg .L3


      My example if is from GCC on x86_64 platform on Linux.



      Compiler writers are pretty smart people, and they think of these things and many others most of us take for granted.



      I noticed that if it is not a constant, then the same machine code is generated in either case.



      int b;
      if(a < b)
      cmpl -4(%rbp), %eax
      jge .L2

      if(a <=b)
      cmpl -4(%rbp), %eax
      jg .L3





      share|improve this answer





















      • 9





        Note that this is specific to x86.

        – Michael Petrotta
        Aug 27 '12 at 2:17






      • 10





        I think you should use that if(a <=900) to demonstrate that it generates exactly the same asm :)

        – Lipis
        Aug 27 '12 at 2:22








      • 2





        @AdrianCornish Sorry.. I edited it.. it's more or less the same.. but if you change the second if to <=900 then the asm code will be exactly the same :) It's pretty much the same now.. but you know.. for the OCD :)

        – Lipis
        Aug 27 '12 at 2:25






      • 3





        @Boann That might get reduced to if (true) and eliminated completely.

        – Qsario
        Aug 27 '12 at 2:32






      • 4





        No one has pointed out that this optimization only applies to constant comparisons. I can guarantee it will not be done like this for comparing two variables.

        – Jonathon Reinhart
        Aug 27 '12 at 3:05














      66












      66








      66







      I see that neither is faster. The compiler generates the same machine code in each condition with a different value.



      if(a < 901)
      cmpl $900, -4(%rbp)
      jg .L2

      if(a <=901)
      cmpl $901, -4(%rbp)
      jg .L3


      My example if is from GCC on x86_64 platform on Linux.



      Compiler writers are pretty smart people, and they think of these things and many others most of us take for granted.



      I noticed that if it is not a constant, then the same machine code is generated in either case.



      int b;
      if(a < b)
      cmpl -4(%rbp), %eax
      jge .L2

      if(a <=b)
      cmpl -4(%rbp), %eax
      jg .L3





      share|improve this answer















      I see that neither is faster. The compiler generates the same machine code in each condition with a different value.



      if(a < 901)
      cmpl $900, -4(%rbp)
      jg .L2

      if(a <=901)
      cmpl $901, -4(%rbp)
      jg .L3


      My example if is from GCC on x86_64 platform on Linux.



      Compiler writers are pretty smart people, and they think of these things and many others most of us take for granted.



      I noticed that if it is not a constant, then the same machine code is generated in either case.



      int b;
      if(a < b)
      cmpl -4(%rbp), %eax
      jge .L2

      if(a <=b)
      cmpl -4(%rbp), %eax
      jg .L3






      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Nov 22 '12 at 5:48









      Peter Mortensen

      13.6k1985111




      13.6k1985111










      answered Aug 27 '12 at 2:16









      Adrian CornishAdrian Cornish

      14.7k63151




      14.7k63151








      • 9





        Note that this is specific to x86.

        – Michael Petrotta
        Aug 27 '12 at 2:17






      • 10





        I think you should use that if(a <=900) to demonstrate that it generates exactly the same asm :)

        – Lipis
        Aug 27 '12 at 2:22








      • 2





        @AdrianCornish Sorry.. I edited it.. it's more or less the same.. but if you change the second if to <=900 then the asm code will be exactly the same :) It's pretty much the same now.. but you know.. for the OCD :)

        – Lipis
        Aug 27 '12 at 2:25






      • 3





        @Boann That might get reduced to if (true) and eliminated completely.

        – Qsario
        Aug 27 '12 at 2:32






      • 4





        No one has pointed out that this optimization only applies to constant comparisons. I can guarantee it will not be done like this for comparing two variables.

        – Jonathon Reinhart
        Aug 27 '12 at 3:05














      • 9





        Note that this is specific to x86.

        – Michael Petrotta
        Aug 27 '12 at 2:17






      • 10





        I think you should use that if(a <=900) to demonstrate that it generates exactly the same asm :)

        – Lipis
        Aug 27 '12 at 2:22








      • 2





        @AdrianCornish Sorry.. I edited it.. it's more or less the same.. but if you change the second if to <=900 then the asm code will be exactly the same :) It's pretty much the same now.. but you know.. for the OCD :)

        – Lipis
        Aug 27 '12 at 2:25






      • 3





        @Boann That might get reduced to if (true) and eliminated completely.

        – Qsario
        Aug 27 '12 at 2:32






      • 4





        No one has pointed out that this optimization only applies to constant comparisons. I can guarantee it will not be done like this for comparing two variables.

        – Jonathon Reinhart
        Aug 27 '12 at 3:05








      9




      9





      Note that this is specific to x86.

      – Michael Petrotta
      Aug 27 '12 at 2:17





      Note that this is specific to x86.

      – Michael Petrotta
      Aug 27 '12 at 2:17




      10




      10





      I think you should use that if(a <=900) to demonstrate that it generates exactly the same asm :)

      – Lipis
      Aug 27 '12 at 2:22







      I think you should use that if(a <=900) to demonstrate that it generates exactly the same asm :)

      – Lipis
      Aug 27 '12 at 2:22






      2




      2





      @AdrianCornish Sorry.. I edited it.. it's more or less the same.. but if you change the second if to <=900 then the asm code will be exactly the same :) It's pretty much the same now.. but you know.. for the OCD :)

      – Lipis
      Aug 27 '12 at 2:25





      @AdrianCornish Sorry.. I edited it.. it's more or less the same.. but if you change the second if to <=900 then the asm code will be exactly the same :) It's pretty much the same now.. but you know.. for the OCD :)

      – Lipis
      Aug 27 '12 at 2:25




      3




      3





      @Boann That might get reduced to if (true) and eliminated completely.

      – Qsario
      Aug 27 '12 at 2:32





      @Boann That might get reduced to if (true) and eliminated completely.

      – Qsario
      Aug 27 '12 at 2:32




      4




      4





      No one has pointed out that this optimization only applies to constant comparisons. I can guarantee it will not be done like this for comparing two variables.

      – Jonathon Reinhart
      Aug 27 '12 at 3:05





      No one has pointed out that this optimization only applies to constant comparisons. I can guarantee it will not be done like this for comparing two variables.

      – Jonathon Reinhart
      Aug 27 '12 at 3:05











      50














      For floating point code, the <= comparison may indeed be slower (by one instruction) even on modern architectures. Here's the first function:



      int compare_strict(double a, double b) { return a < b; }


      On PowerPC, first this performs a floating point comparison (which updates cr, the condition register), then moves the condition register to a GPR, shifts the "compared less than" bit into place, and then returns. It takes four instructions.



      Now consider this function instead:



      int compare_loose(double a, double b) { return a <= b; }


      This requires the same work as compare_strict above, but now there's two bits of interest: "was less than" and "was equal to." This requires an extra instruction (cror - condition register bitwise OR) to combine these two bits into one. So compare_loose requires five instructions, while compare_strict requires four.



      You might think that the compiler could optimize the second function like so:



      int compare_loose(double a, double b) { return ! (a > b); }


      However this will incorrectly handle NaNs. NaN1 <= NaN2 and NaN1 > NaN2 need to both evaluate to false.






      share|improve this answer


























      • Luckily it doesn't work like this on x86 (x87). fucomip sets ZF and CF.

        – Jonathon Reinhart
        Aug 27 '12 at 20:30






      • 3





        @JonathonReinhart: I think you're misunderstanding what the PowerPC is doing -- the condition register cr is the equivalent to flags like ZF and CF on the x86. (Although the CR is more flexible.) What the poster is talking about is moving the result to a GPR: which takes two instructions on PowerPC, but x86 has a conditional move instruction.

        – Dietrich Epp
        Aug 28 '12 at 6:19











      • @DietrichEpp What I meant to add after my statement was: Which you can then immediately jump based upon the value of EFLAGS. Sorry for not being clear.

        – Jonathon Reinhart
        Aug 28 '12 at 7:16






      • 1





        @JonathonReinhart: Yes, and you can also immediately jump based on the value of the CR. The answer is not talking about jumping, which is where the extra instructions come from.

        – Dietrich Epp
        Aug 28 '12 at 7:38
















      50














      For floating point code, the <= comparison may indeed be slower (by one instruction) even on modern architectures. Here's the first function:



      int compare_strict(double a, double b) { return a < b; }


      On PowerPC, first this performs a floating point comparison (which updates cr, the condition register), then moves the condition register to a GPR, shifts the "compared less than" bit into place, and then returns. It takes four instructions.



      Now consider this function instead:



      int compare_loose(double a, double b) { return a <= b; }


      This requires the same work as compare_strict above, but now there's two bits of interest: "was less than" and "was equal to." This requires an extra instruction (cror - condition register bitwise OR) to combine these two bits into one. So compare_loose requires five instructions, while compare_strict requires four.



      You might think that the compiler could optimize the second function like so:



      int compare_loose(double a, double b) { return ! (a > b); }


      However this will incorrectly handle NaNs. NaN1 <= NaN2 and NaN1 > NaN2 need to both evaluate to false.






      share|improve this answer


























      • Luckily it doesn't work like this on x86 (x87). fucomip sets ZF and CF.

        – Jonathon Reinhart
        Aug 27 '12 at 20:30






      • 3





        @JonathonReinhart: I think you're misunderstanding what the PowerPC is doing -- the condition register cr is the equivalent to flags like ZF and CF on the x86. (Although the CR is more flexible.) What the poster is talking about is moving the result to a GPR: which takes two instructions on PowerPC, but x86 has a conditional move instruction.

        – Dietrich Epp
        Aug 28 '12 at 6:19











      • @DietrichEpp What I meant to add after my statement was: Which you can then immediately jump based upon the value of EFLAGS. Sorry for not being clear.

        – Jonathon Reinhart
        Aug 28 '12 at 7:16






      • 1





        @JonathonReinhart: Yes, and you can also immediately jump based on the value of the CR. The answer is not talking about jumping, which is where the extra instructions come from.

        – Dietrich Epp
        Aug 28 '12 at 7:38














      50












      50








      50







      For floating point code, the <= comparison may indeed be slower (by one instruction) even on modern architectures. Here's the first function:



      int compare_strict(double a, double b) { return a < b; }


      On PowerPC, first this performs a floating point comparison (which updates cr, the condition register), then moves the condition register to a GPR, shifts the "compared less than" bit into place, and then returns. It takes four instructions.



      Now consider this function instead:



      int compare_loose(double a, double b) { return a <= b; }


      This requires the same work as compare_strict above, but now there's two bits of interest: "was less than" and "was equal to." This requires an extra instruction (cror - condition register bitwise OR) to combine these two bits into one. So compare_loose requires five instructions, while compare_strict requires four.



      You might think that the compiler could optimize the second function like so:



      int compare_loose(double a, double b) { return ! (a > b); }


      However this will incorrectly handle NaNs. NaN1 <= NaN2 and NaN1 > NaN2 need to both evaluate to false.






      share|improve this answer















      For floating point code, the <= comparison may indeed be slower (by one instruction) even on modern architectures. Here's the first function:



      int compare_strict(double a, double b) { return a < b; }


      On PowerPC, first this performs a floating point comparison (which updates cr, the condition register), then moves the condition register to a GPR, shifts the "compared less than" bit into place, and then returns. It takes four instructions.



      Now consider this function instead:



      int compare_loose(double a, double b) { return a <= b; }


      This requires the same work as compare_strict above, but now there's two bits of interest: "was less than" and "was equal to." This requires an extra instruction (cror - condition register bitwise OR) to combine these two bits into one. So compare_loose requires five instructions, while compare_strict requires four.



      You might think that the compiler could optimize the second function like so:



      int compare_loose(double a, double b) { return ! (a > b); }


      However this will incorrectly handle NaNs. NaN1 <= NaN2 and NaN1 > NaN2 need to both evaluate to false.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Aug 27 '12 at 18:38

























      answered Aug 27 '12 at 18:32









      ridiculous_fishridiculous_fish

      9,1773036




      9,1773036













      • Luckily it doesn't work like this on x86 (x87). fucomip sets ZF and CF.

        – Jonathon Reinhart
        Aug 27 '12 at 20:30






      • 3





        @JonathonReinhart: I think you're misunderstanding what the PowerPC is doing -- the condition register cr is the equivalent to flags like ZF and CF on the x86. (Although the CR is more flexible.) What the poster is talking about is moving the result to a GPR: which takes two instructions on PowerPC, but x86 has a conditional move instruction.

        – Dietrich Epp
        Aug 28 '12 at 6:19











      • @DietrichEpp What I meant to add after my statement was: Which you can then immediately jump based upon the value of EFLAGS. Sorry for not being clear.

        – Jonathon Reinhart
        Aug 28 '12 at 7:16






      • 1





        @JonathonReinhart: Yes, and you can also immediately jump based on the value of the CR. The answer is not talking about jumping, which is where the extra instructions come from.

        – Dietrich Epp
        Aug 28 '12 at 7:38



















      • Luckily it doesn't work like this on x86 (x87). fucomip sets ZF and CF.

        – Jonathon Reinhart
        Aug 27 '12 at 20:30






      • 3





        @JonathonReinhart: I think you're misunderstanding what the PowerPC is doing -- the condition register cr is the equivalent to flags like ZF and CF on the x86. (Although the CR is more flexible.) What the poster is talking about is moving the result to a GPR: which takes two instructions on PowerPC, but x86 has a conditional move instruction.

        – Dietrich Epp
        Aug 28 '12 at 6:19











      • @DietrichEpp What I meant to add after my statement was: Which you can then immediately jump based upon the value of EFLAGS. Sorry for not being clear.

        – Jonathon Reinhart
        Aug 28 '12 at 7:16






      • 1





        @JonathonReinhart: Yes, and you can also immediately jump based on the value of the CR. The answer is not talking about jumping, which is where the extra instructions come from.

        – Dietrich Epp
        Aug 28 '12 at 7:38

















      Luckily it doesn't work like this on x86 (x87). fucomip sets ZF and CF.

      – Jonathon Reinhart
      Aug 27 '12 at 20:30





      Luckily it doesn't work like this on x86 (x87). fucomip sets ZF and CF.

      – Jonathon Reinhart
      Aug 27 '12 at 20:30




      3




      3





      @JonathonReinhart: I think you're misunderstanding what the PowerPC is doing -- the condition register cr is the equivalent to flags like ZF and CF on the x86. (Although the CR is more flexible.) What the poster is talking about is moving the result to a GPR: which takes two instructions on PowerPC, but x86 has a conditional move instruction.

      – Dietrich Epp
      Aug 28 '12 at 6:19





      @JonathonReinhart: I think you're misunderstanding what the PowerPC is doing -- the condition register cr is the equivalent to flags like ZF and CF on the x86. (Although the CR is more flexible.) What the poster is talking about is moving the result to a GPR: which takes two instructions on PowerPC, but x86 has a conditional move instruction.

      – Dietrich Epp
      Aug 28 '12 at 6:19













      @DietrichEpp What I meant to add after my statement was: Which you can then immediately jump based upon the value of EFLAGS. Sorry for not being clear.

      – Jonathon Reinhart
      Aug 28 '12 at 7:16





      @DietrichEpp What I meant to add after my statement was: Which you can then immediately jump based upon the value of EFLAGS. Sorry for not being clear.

      – Jonathon Reinhart
      Aug 28 '12 at 7:16




      1




      1





      @JonathonReinhart: Yes, and you can also immediately jump based on the value of the CR. The answer is not talking about jumping, which is where the extra instructions come from.

      – Dietrich Epp
      Aug 28 '12 at 7:38





      @JonathonReinhart: Yes, and you can also immediately jump based on the value of the CR. The answer is not talking about jumping, which is where the extra instructions come from.

      – Dietrich Epp
      Aug 28 '12 at 7:38











      34














      Maybe the author of that unnamed book has read that a > 0 runs faster than a >= 1 and thinks that is true universally.



      But it is because a 0 is involved (because CMP can, depending on the architecture, replaced e.g. with OR) and not because of the <.






      share|improve this answer



















      • 1





        Sure, in a "debug" build, but it would take a bad compiler for (a >= 1) to run slower than (a > 0), since the former can be trivially transformed to the latter by the optimizer..

        – BeeOnRope
        Jun 16 '16 at 2:22








      • 1





        @BeeOnRope Sometimes I am surprised what complicated things an optimizer can optimize and on what easy things it fails to do so.

        – glglgl
        Jun 16 '16 at 7:31






      • 1





        Indeed, and it's always worth checking the asm output for the very few functions where it would matter. That said, the above transformation is very basic and has been performed even in simple compilers for decades.

        – BeeOnRope
        Jun 16 '16 at 20:27
















      34














      Maybe the author of that unnamed book has read that a > 0 runs faster than a >= 1 and thinks that is true universally.



      But it is because a 0 is involved (because CMP can, depending on the architecture, replaced e.g. with OR) and not because of the <.






      share|improve this answer



















      • 1





        Sure, in a "debug" build, but it would take a bad compiler for (a >= 1) to run slower than (a > 0), since the former can be trivially transformed to the latter by the optimizer..

        – BeeOnRope
        Jun 16 '16 at 2:22








      • 1





        @BeeOnRope Sometimes I am surprised what complicated things an optimizer can optimize and on what easy things it fails to do so.

        – glglgl
        Jun 16 '16 at 7:31






      • 1





        Indeed, and it's always worth checking the asm output for the very few functions where it would matter. That said, the above transformation is very basic and has been performed even in simple compilers for decades.

        – BeeOnRope
        Jun 16 '16 at 20:27














      34












      34








      34







      Maybe the author of that unnamed book has read that a > 0 runs faster than a >= 1 and thinks that is true universally.



      But it is because a 0 is involved (because CMP can, depending on the architecture, replaced e.g. with OR) and not because of the <.






      share|improve this answer













      Maybe the author of that unnamed book has read that a > 0 runs faster than a >= 1 and thinks that is true universally.



      But it is because a 0 is involved (because CMP can, depending on the architecture, replaced e.g. with OR) and not because of the <.







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Aug 27 '12 at 13:05









      glglglglglgl

      66.8k792165




      66.8k792165








      • 1





        Sure, in a "debug" build, but it would take a bad compiler for (a >= 1) to run slower than (a > 0), since the former can be trivially transformed to the latter by the optimizer..

        – BeeOnRope
        Jun 16 '16 at 2:22








      • 1





        @BeeOnRope Sometimes I am surprised what complicated things an optimizer can optimize and on what easy things it fails to do so.

        – glglgl
        Jun 16 '16 at 7:31






      • 1





        Indeed, and it's always worth checking the asm output for the very few functions where it would matter. That said, the above transformation is very basic and has been performed even in simple compilers for decades.

        – BeeOnRope
        Jun 16 '16 at 20:27














      • 1





        Sure, in a "debug" build, but it would take a bad compiler for (a >= 1) to run slower than (a > 0), since the former can be trivially transformed to the latter by the optimizer..

        – BeeOnRope
        Jun 16 '16 at 2:22








      • 1





        @BeeOnRope Sometimes I am surprised what complicated things an optimizer can optimize and on what easy things it fails to do so.

        – glglgl
        Jun 16 '16 at 7:31






      • 1





        Indeed, and it's always worth checking the asm output for the very few functions where it would matter. That said, the above transformation is very basic and has been performed even in simple compilers for decades.

        – BeeOnRope
        Jun 16 '16 at 20:27








      1




      1





      Sure, in a "debug" build, but it would take a bad compiler for (a >= 1) to run slower than (a > 0), since the former can be trivially transformed to the latter by the optimizer..

      – BeeOnRope
      Jun 16 '16 at 2:22







      Sure, in a "debug" build, but it would take a bad compiler for (a >= 1) to run slower than (a > 0), since the former can be trivially transformed to the latter by the optimizer..

      – BeeOnRope
      Jun 16 '16 at 2:22






      1




      1





      @BeeOnRope Sometimes I am surprised what complicated things an optimizer can optimize and on what easy things it fails to do so.

      – glglgl
      Jun 16 '16 at 7:31





      @BeeOnRope Sometimes I am surprised what complicated things an optimizer can optimize and on what easy things it fails to do so.

      – glglgl
      Jun 16 '16 at 7:31




      1




      1





      Indeed, and it's always worth checking the asm output for the very few functions where it would matter. That said, the above transformation is very basic and has been performed even in simple compilers for decades.

      – BeeOnRope
      Jun 16 '16 at 20:27





      Indeed, and it's always worth checking the asm output for the very few functions where it would matter. That said, the above transformation is very basic and has been performed even in simple compilers for decades.

      – BeeOnRope
      Jun 16 '16 at 20:27











      30














      At the very least, if this were true a compiler could trivially optimise a <= b to !(a > b), and so even if the comparison itself were actually slower, with all but the most naive compiler you would not notice a difference.






      share|improve this answer


























      • Why !(a>b) is optimised version of a <=b . Isnt !(a>b) 2 operation in one ?

        – Abhishek Singh
        Apr 7 '15 at 11:33






      • 3





        @AbhishekSingh NOT is just made by other instruction (je vs. jne)

        – Pavel Gatnar
        Apr 14 '15 at 16:03
















      30














      At the very least, if this were true a compiler could trivially optimise a <= b to !(a > b), and so even if the comparison itself were actually slower, with all but the most naive compiler you would not notice a difference.






      share|improve this answer


























      • Why !(a>b) is optimised version of a <=b . Isnt !(a>b) 2 operation in one ?

        – Abhishek Singh
        Apr 7 '15 at 11:33






      • 3





        @AbhishekSingh NOT is just made by other instruction (je vs. jne)

        – Pavel Gatnar
        Apr 14 '15 at 16:03














      30












      30








      30







      At the very least, if this were true a compiler could trivially optimise a <= b to !(a > b), and so even if the comparison itself were actually slower, with all but the most naive compiler you would not notice a difference.






      share|improve this answer















      At the very least, if this were true a compiler could trivially optimise a <= b to !(a > b), and so even if the comparison itself were actually slower, with all but the most naive compiler you would not notice a difference.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Nov 30 '12 at 0:22

























      answered Aug 27 '12 at 9:23









      Eliot BallEliot Ball

      652511




      652511













      • Why !(a>b) is optimised version of a <=b . Isnt !(a>b) 2 operation in one ?

        – Abhishek Singh
        Apr 7 '15 at 11:33






      • 3





        @AbhishekSingh NOT is just made by other instruction (je vs. jne)

        – Pavel Gatnar
        Apr 14 '15 at 16:03



















      • Why !(a>b) is optimised version of a <=b . Isnt !(a>b) 2 operation in one ?

        – Abhishek Singh
        Apr 7 '15 at 11:33






      • 3





        @AbhishekSingh NOT is just made by other instruction (je vs. jne)

        – Pavel Gatnar
        Apr 14 '15 at 16:03

















      Why !(a>b) is optimised version of a <=b . Isnt !(a>b) 2 operation in one ?

      – Abhishek Singh
      Apr 7 '15 at 11:33





      Why !(a>b) is optimised version of a <=b . Isnt !(a>b) 2 operation in one ?

      – Abhishek Singh
      Apr 7 '15 at 11:33




      3




      3





      @AbhishekSingh NOT is just made by other instruction (je vs. jne)

      – Pavel Gatnar
      Apr 14 '15 at 16:03





      @AbhishekSingh NOT is just made by other instruction (je vs. jne)

      – Pavel Gatnar
      Apr 14 '15 at 16:03











      15














      They have the same speed. Maybe in some special architecture what he/she said is right, but in the x86 family at least I know they are the same. Because for doing this the CPU will do a substraction (a - b) and then check the flags of the flag register. Two bits of that register are called ZF (zero Flag) and SF (sign flag), and it is done in one cycle, because it will do it with one mask operation.






      share|improve this answer






























        15














        They have the same speed. Maybe in some special architecture what he/she said is right, but in the x86 family at least I know they are the same. Because for doing this the CPU will do a substraction (a - b) and then check the flags of the flag register. Two bits of that register are called ZF (zero Flag) and SF (sign flag), and it is done in one cycle, because it will do it with one mask operation.






        share|improve this answer




























          15












          15








          15







          They have the same speed. Maybe in some special architecture what he/she said is right, but in the x86 family at least I know they are the same. Because for doing this the CPU will do a substraction (a - b) and then check the flags of the flag register. Two bits of that register are called ZF (zero Flag) and SF (sign flag), and it is done in one cycle, because it will do it with one mask operation.






          share|improve this answer















          They have the same speed. Maybe in some special architecture what he/she said is right, but in the x86 family at least I know they are the same. Because for doing this the CPU will do a substraction (a - b) and then check the flags of the flag register. Two bits of that register are called ZF (zero Flag) and SF (sign flag), and it is done in one cycle, because it will do it with one mask operation.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 22 '12 at 5:52









          Peter Mortensen

          13.6k1985111




          13.6k1985111










          answered Aug 27 '12 at 8:25









          MasoudMasoud

          64741127




          64741127























              13














              This would be highly dependent on the underlying architecture that the C is compiled to. Some processors and architectures might have explicit instructions for equal to, or less than and equal to, which execute in different numbers of cycles.



              That would be pretty unusual though, as the compiler could work around it, making it irrelevant.






              share|improve this answer



















              • 1





                IF there was a difference in the cyles. 1) it would not be detectable. 2) Any compiler worth its salt would already be making the transformation from the slow form to the faster form without changing the meaning of the code. So the resulting instruction planted would be identical.

                – Martin York
                Aug 27 '12 at 7:00











              • Agreed completely, it would be a pretty trivial and silly difference in any case. Certainly nothing to mention in a book that should be platform agnostic.

                – Telgin
                Aug 28 '12 at 3:46











              • @lttlrck: I get it. Took me a while (silly me). No they are not detectable because there are so many other things happening that make their measurement imposable. Processor stalls/ cache misses/ signals/ process swapping. Thus in a normal OS situation things on the single cycle level can not be physically measurable. If you can eliminate all that interference from you measurements (run it on a chip with on-board memory and no OS) then you still have granularity of your timers to worry about but theoretically if you run it long enough you could see something.

                – Martin York
                Aug 29 '12 at 6:57
















              13














              This would be highly dependent on the underlying architecture that the C is compiled to. Some processors and architectures might have explicit instructions for equal to, or less than and equal to, which execute in different numbers of cycles.



              That would be pretty unusual though, as the compiler could work around it, making it irrelevant.






              share|improve this answer



















              • 1





                IF there was a difference in the cyles. 1) it would not be detectable. 2) Any compiler worth its salt would already be making the transformation from the slow form to the faster form without changing the meaning of the code. So the resulting instruction planted would be identical.

                – Martin York
                Aug 27 '12 at 7:00











              • Agreed completely, it would be a pretty trivial and silly difference in any case. Certainly nothing to mention in a book that should be platform agnostic.

                – Telgin
                Aug 28 '12 at 3:46











              • @lttlrck: I get it. Took me a while (silly me). No they are not detectable because there are so many other things happening that make their measurement imposable. Processor stalls/ cache misses/ signals/ process swapping. Thus in a normal OS situation things on the single cycle level can not be physically measurable. If you can eliminate all that interference from you measurements (run it on a chip with on-board memory and no OS) then you still have granularity of your timers to worry about but theoretically if you run it long enough you could see something.

                – Martin York
                Aug 29 '12 at 6:57














              13












              13








              13







              This would be highly dependent on the underlying architecture that the C is compiled to. Some processors and architectures might have explicit instructions for equal to, or less than and equal to, which execute in different numbers of cycles.



              That would be pretty unusual though, as the compiler could work around it, making it irrelevant.






              share|improve this answer













              This would be highly dependent on the underlying architecture that the C is compiled to. Some processors and architectures might have explicit instructions for equal to, or less than and equal to, which execute in different numbers of cycles.



              That would be pretty unusual though, as the compiler could work around it, making it irrelevant.







              share|improve this answer












              share|improve this answer



              share|improve this answer










              answered Aug 27 '12 at 2:15









              TelginTelgin

              1,464810




              1,464810








              • 1





                IF there was a difference in the cyles. 1) it would not be detectable. 2) Any compiler worth its salt would already be making the transformation from the slow form to the faster form without changing the meaning of the code. So the resulting instruction planted would be identical.

                – Martin York
                Aug 27 '12 at 7:00











              • Agreed completely, it would be a pretty trivial and silly difference in any case. Certainly nothing to mention in a book that should be platform agnostic.

                – Telgin
                Aug 28 '12 at 3:46











              • @lttlrck: I get it. Took me a while (silly me). No they are not detectable because there are so many other things happening that make their measurement imposable. Processor stalls/ cache misses/ signals/ process swapping. Thus in a normal OS situation things on the single cycle level can not be physically measurable. If you can eliminate all that interference from you measurements (run it on a chip with on-board memory and no OS) then you still have granularity of your timers to worry about but theoretically if you run it long enough you could see something.

                – Martin York
                Aug 29 '12 at 6:57














              • 1





                IF there was a difference in the cyles. 1) it would not be detectable. 2) Any compiler worth its salt would already be making the transformation from the slow form to the faster form without changing the meaning of the code. So the resulting instruction planted would be identical.

                – Martin York
                Aug 27 '12 at 7:00











              • Agreed completely, it would be a pretty trivial and silly difference in any case. Certainly nothing to mention in a book that should be platform agnostic.

                – Telgin
                Aug 28 '12 at 3:46











              • @lttlrck: I get it. Took me a while (silly me). No they are not detectable because there are so many other things happening that make their measurement imposable. Processor stalls/ cache misses/ signals/ process swapping. Thus in a normal OS situation things on the single cycle level can not be physically measurable. If you can eliminate all that interference from you measurements (run it on a chip with on-board memory and no OS) then you still have granularity of your timers to worry about but theoretically if you run it long enough you could see something.

                – Martin York
                Aug 29 '12 at 6:57








              1




              1





              IF there was a difference in the cyles. 1) it would not be detectable. 2) Any compiler worth its salt would already be making the transformation from the slow form to the faster form without changing the meaning of the code. So the resulting instruction planted would be identical.

              – Martin York
              Aug 27 '12 at 7:00





              IF there was a difference in the cyles. 1) it would not be detectable. 2) Any compiler worth its salt would already be making the transformation from the slow form to the faster form without changing the meaning of the code. So the resulting instruction planted would be identical.

              – Martin York
              Aug 27 '12 at 7:00













              Agreed completely, it would be a pretty trivial and silly difference in any case. Certainly nothing to mention in a book that should be platform agnostic.

              – Telgin
              Aug 28 '12 at 3:46





              Agreed completely, it would be a pretty trivial and silly difference in any case. Certainly nothing to mention in a book that should be platform agnostic.

              – Telgin
              Aug 28 '12 at 3:46













              @lttlrck: I get it. Took me a while (silly me). No they are not detectable because there are so many other things happening that make their measurement imposable. Processor stalls/ cache misses/ signals/ process swapping. Thus in a normal OS situation things on the single cycle level can not be physically measurable. If you can eliminate all that interference from you measurements (run it on a chip with on-board memory and no OS) then you still have granularity of your timers to worry about but theoretically if you run it long enough you could see something.

              – Martin York
              Aug 29 '12 at 6:57





              @lttlrck: I get it. Took me a while (silly me). No they are not detectable because there are so many other things happening that make their measurement imposable. Processor stalls/ cache misses/ signals/ process swapping. Thus in a normal OS situation things on the single cycle level can not be physically measurable. If you can eliminate all that interference from you measurements (run it on a chip with on-board memory and no OS) then you still have granularity of your timers to worry about but theoretically if you run it long enough you could see something.

              – Martin York
              Aug 29 '12 at 6:57











              11















              TL;DR answer



              For most combinations of architecture, compiler and language it will not be quicker.



              Full answer



              Other answers have concentrated on x86 architecture, and I don't know the ARM architecture (which your example assembler seems to be) well enough to comment specifically on the code generated, but this is an example of a micro-optimisation which is very architecture specific, and is as likely to be an anti-optimisation as it is to be an optimisation.



              As such, I would suggest that this sort of micro-optimisation is an example of cargo cult programming rather than best software engineering practice.



              There are probably some architectures where this is an optimisation, but I know of at least one architecture where the opposite may be true. The venerable Transputer architecture only had machine code instructions for equal to and greater than or equal to, so all comparisons had to be built from these primitives.



              Even then, in almost all cases, the compiler could order the evaluation instructions in such a way that in practice, no comparison had any advantage over any other. Worst case though, it might need to add a reverse instruction (REV) to swap the top two items on the operand stack. This was a single byte instruction which took a single cycle to run, so had the smallest overhead possible.



              Whether or not a micro-optimisation like this is an optimisation or an anti-optimisation depends on the specific architecture you are using, so it is usually a bad idea to get into the habit of using architecture specific micro-optimisations, otherwise you might instinctively use one when it is inappropriate to do so, and it looks like this is exactly what the book you are reading is advocating.






              share|improve this answer






























                11















                TL;DR answer



                For most combinations of architecture, compiler and language it will not be quicker.



                Full answer



                Other answers have concentrated on x86 architecture, and I don't know the ARM architecture (which your example assembler seems to be) well enough to comment specifically on the code generated, but this is an example of a micro-optimisation which is very architecture specific, and is as likely to be an anti-optimisation as it is to be an optimisation.



                As such, I would suggest that this sort of micro-optimisation is an example of cargo cult programming rather than best software engineering practice.



                There are probably some architectures where this is an optimisation, but I know of at least one architecture where the opposite may be true. The venerable Transputer architecture only had machine code instructions for equal to and greater than or equal to, so all comparisons had to be built from these primitives.



                Even then, in almost all cases, the compiler could order the evaluation instructions in such a way that in practice, no comparison had any advantage over any other. Worst case though, it might need to add a reverse instruction (REV) to swap the top two items on the operand stack. This was a single byte instruction which took a single cycle to run, so had the smallest overhead possible.



                Whether or not a micro-optimisation like this is an optimisation or an anti-optimisation depends on the specific architecture you are using, so it is usually a bad idea to get into the habit of using architecture specific micro-optimisations, otherwise you might instinctively use one when it is inappropriate to do so, and it looks like this is exactly what the book you are reading is advocating.






                share|improve this answer




























                  11












                  11








                  11








                  TL;DR answer



                  For most combinations of architecture, compiler and language it will not be quicker.



                  Full answer



                  Other answers have concentrated on x86 architecture, and I don't know the ARM architecture (which your example assembler seems to be) well enough to comment specifically on the code generated, but this is an example of a micro-optimisation which is very architecture specific, and is as likely to be an anti-optimisation as it is to be an optimisation.



                  As such, I would suggest that this sort of micro-optimisation is an example of cargo cult programming rather than best software engineering practice.



                  There are probably some architectures where this is an optimisation, but I know of at least one architecture where the opposite may be true. The venerable Transputer architecture only had machine code instructions for equal to and greater than or equal to, so all comparisons had to be built from these primitives.



                  Even then, in almost all cases, the compiler could order the evaluation instructions in such a way that in practice, no comparison had any advantage over any other. Worst case though, it might need to add a reverse instruction (REV) to swap the top two items on the operand stack. This was a single byte instruction which took a single cycle to run, so had the smallest overhead possible.



                  Whether or not a micro-optimisation like this is an optimisation or an anti-optimisation depends on the specific architecture you are using, so it is usually a bad idea to get into the habit of using architecture specific micro-optimisations, otherwise you might instinctively use one when it is inappropriate to do so, and it looks like this is exactly what the book you are reading is advocating.






                  share|improve this answer
















                  TL;DR answer



                  For most combinations of architecture, compiler and language it will not be quicker.



                  Full answer



                  Other answers have concentrated on x86 architecture, and I don't know the ARM architecture (which your example assembler seems to be) well enough to comment specifically on the code generated, but this is an example of a micro-optimisation which is very architecture specific, and is as likely to be an anti-optimisation as it is to be an optimisation.



                  As such, I would suggest that this sort of micro-optimisation is an example of cargo cult programming rather than best software engineering practice.



                  There are probably some architectures where this is an optimisation, but I know of at least one architecture where the opposite may be true. The venerable Transputer architecture only had machine code instructions for equal to and greater than or equal to, so all comparisons had to be built from these primitives.



                  Even then, in almost all cases, the compiler could order the evaluation instructions in such a way that in practice, no comparison had any advantage over any other. Worst case though, it might need to add a reverse instruction (REV) to swap the top two items on the operand stack. This was a single byte instruction which took a single cycle to run, so had the smallest overhead possible.



                  Whether or not a micro-optimisation like this is an optimisation or an anti-optimisation depends on the specific architecture you are using, so it is usually a bad idea to get into the habit of using architecture specific micro-optimisations, otherwise you might instinctively use one when it is inappropriate to do so, and it looks like this is exactly what the book you are reading is advocating.







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Sep 14 '18 at 11:28

























                  answered Aug 31 '12 at 18:33









                  Mark BoothMark Booth

                  4,8704780




                  4,8704780























                      6














                      You should not be able to notice the difference even if there is any. Besides, in practice, you'll have to do an additional a + 1 or a - 1 to make the condition stand unless you're going to use some magic constants, which is a very bad practice by all means.






                      share|improve this answer



















                      • 1





                        What's the bad practice? Incrementing or decrementing a counter? How do you store index notation then?

                        – jcolebrand
                        Aug 27 '12 at 14:22








                      • 5





                        He means if you're doing comparison of 2 variable types. Of course it's trivial if you're setting the value for a loop or something. But if you have x <= y, and y is unknown, it would be slower to 'optimize' it to x < y + 1

                        – JustinDanielson
                        Aug 27 '12 at 21:48













                      • @JustinDanielson agreed. Not to mention ugly, confusing, etc.

                        – Jonathon Reinhart
                        Aug 27 '12 at 23:49
















                      6














                      You should not be able to notice the difference even if there is any. Besides, in practice, you'll have to do an additional a + 1 or a - 1 to make the condition stand unless you're going to use some magic constants, which is a very bad practice by all means.






                      share|improve this answer



















                      • 1





                        What's the bad practice? Incrementing or decrementing a counter? How do you store index notation then?

                        – jcolebrand
                        Aug 27 '12 at 14:22








                      • 5





                        He means if you're doing comparison of 2 variable types. Of course it's trivial if you're setting the value for a loop or something. But if you have x <= y, and y is unknown, it would be slower to 'optimize' it to x < y + 1

                        – JustinDanielson
                        Aug 27 '12 at 21:48













                      • @JustinDanielson agreed. Not to mention ugly, confusing, etc.

                        – Jonathon Reinhart
                        Aug 27 '12 at 23:49














                      6












                      6








                      6







                      You should not be able to notice the difference even if there is any. Besides, in practice, you'll have to do an additional a + 1 or a - 1 to make the condition stand unless you're going to use some magic constants, which is a very bad practice by all means.






                      share|improve this answer













                      You should not be able to notice the difference even if there is any. Besides, in practice, you'll have to do an additional a + 1 or a - 1 to make the condition stand unless you're going to use some magic constants, which is a very bad practice by all means.







                      share|improve this answer












                      share|improve this answer



                      share|improve this answer










                      answered Aug 27 '12 at 2:17









                      shinkoushinkou

                      4,57811527




                      4,57811527








                      • 1





                        What's the bad practice? Incrementing or decrementing a counter? How do you store index notation then?

                        – jcolebrand
                        Aug 27 '12 at 14:22








                      • 5





                        He means if you're doing comparison of 2 variable types. Of course it's trivial if you're setting the value for a loop or something. But if you have x <= y, and y is unknown, it would be slower to 'optimize' it to x < y + 1

                        – JustinDanielson
                        Aug 27 '12 at 21:48













                      • @JustinDanielson agreed. Not to mention ugly, confusing, etc.

                        – Jonathon Reinhart
                        Aug 27 '12 at 23:49














                      • 1





                        What's the bad practice? Incrementing or decrementing a counter? How do you store index notation then?

                        – jcolebrand
                        Aug 27 '12 at 14:22








                      • 5





                        He means if you're doing comparison of 2 variable types. Of course it's trivial if you're setting the value for a loop or something. But if you have x <= y, and y is unknown, it would be slower to 'optimize' it to x < y + 1

                        – JustinDanielson
                        Aug 27 '12 at 21:48













                      • @JustinDanielson agreed. Not to mention ugly, confusing, etc.

                        – Jonathon Reinhart
                        Aug 27 '12 at 23:49








                      1




                      1





                      What's the bad practice? Incrementing or decrementing a counter? How do you store index notation then?

                      – jcolebrand
                      Aug 27 '12 at 14:22







                      What's the bad practice? Incrementing or decrementing a counter? How do you store index notation then?

                      – jcolebrand
                      Aug 27 '12 at 14:22






                      5




                      5





                      He means if you're doing comparison of 2 variable types. Of course it's trivial if you're setting the value for a loop or something. But if you have x <= y, and y is unknown, it would be slower to 'optimize' it to x < y + 1

                      – JustinDanielson
                      Aug 27 '12 at 21:48







                      He means if you're doing comparison of 2 variable types. Of course it's trivial if you're setting the value for a loop or something. But if you have x <= y, and y is unknown, it would be slower to 'optimize' it to x < y + 1

                      – JustinDanielson
                      Aug 27 '12 at 21:48















                      @JustinDanielson agreed. Not to mention ugly, confusing, etc.

                      – Jonathon Reinhart
                      Aug 27 '12 at 23:49





                      @JustinDanielson agreed. Not to mention ugly, confusing, etc.

                      – Jonathon Reinhart
                      Aug 27 '12 at 23:49











                      3














                      You could say that line is correct in most scripting languages, since the extra character results in slightly slower code processing.
                      However, as the top answer pointed out, it should have no effect in C++, and anything being done with a scripting language probably isn't that concerned about optimization.






                      share|improve this answer
























                      • I somewhat disagree. In competitive programming, scripting languages often offer the quickest solution to a problem, but correct techniques (read: optimization) must be applied to get a correct solution.

                        – Tyler Crompton
                        Sep 5 '12 at 0:59
















                      3














                      You could say that line is correct in most scripting languages, since the extra character results in slightly slower code processing.
                      However, as the top answer pointed out, it should have no effect in C++, and anything being done with a scripting language probably isn't that concerned about optimization.






                      share|improve this answer
























                      • I somewhat disagree. In competitive programming, scripting languages often offer the quickest solution to a problem, but correct techniques (read: optimization) must be applied to get a correct solution.

                        – Tyler Crompton
                        Sep 5 '12 at 0:59














                      3












                      3








                      3







                      You could say that line is correct in most scripting languages, since the extra character results in slightly slower code processing.
                      However, as the top answer pointed out, it should have no effect in C++, and anything being done with a scripting language probably isn't that concerned about optimization.






                      share|improve this answer













                      You could say that line is correct in most scripting languages, since the extra character results in slightly slower code processing.
                      However, as the top answer pointed out, it should have no effect in C++, and anything being done with a scripting language probably isn't that concerned about optimization.







                      share|improve this answer












                      share|improve this answer



                      share|improve this answer










                      answered Aug 29 '12 at 2:47









                      EckstersEcksters

                      548620




                      548620













                      • I somewhat disagree. In competitive programming, scripting languages often offer the quickest solution to a problem, but correct techniques (read: optimization) must be applied to get a correct solution.

                        – Tyler Crompton
                        Sep 5 '12 at 0:59



















                      • I somewhat disagree. In competitive programming, scripting languages often offer the quickest solution to a problem, but correct techniques (read: optimization) must be applied to get a correct solution.

                        – Tyler Crompton
                        Sep 5 '12 at 0:59

















                      I somewhat disagree. In competitive programming, scripting languages often offer the quickest solution to a problem, but correct techniques (read: optimization) must be applied to get a correct solution.

                      – Tyler Crompton
                      Sep 5 '12 at 0:59





                      I somewhat disagree. In competitive programming, scripting languages often offer the quickest solution to a problem, but correct techniques (read: optimization) must be applied to get a correct solution.

                      – Tyler Crompton
                      Sep 5 '12 at 0:59











                      1














                      When I wrote this answer, I was only looking at the title question about < vs. <= in general, not the specific example of a constant a < 901 vs. a <= 900. Many compilers always shrink the magnitude of constants by converting between < and <=, e.g. because x86 immediate operand have a shorter 1-byte encoding for -128..127.



                      For ARM and especially AArch64, being able to encode as an immediate depends on being able to rotate a narrow field into any position in a word. So cmp w0, #0x00f000 would be encodeable, while cmp w0, #0x00effff might not be. So the make-it-smaller rule for comparison vs. a compile-time constant doesn't always apply for AArch64.





                      < vs. <= in general, including for runtime-variable conditions



                      In assembly language on most machines, a comparison for <= has the same cost as a comparison for <. This applies whether you're branching on it, booleanizing it to create a 0/1 integer, or using it as a predicate for a branchless select operation (like x86 CMOV). The other answers have only addressed this part of the question.



                      But this question is about the C++ operators, the input to the optimizer. Normally they're both equally efficient; the advice from the book sounds totally bogus because compilers can always transform the comparison that they implement in asm. But there is at least one exception where using <= can accidentally create something the compiler can't optimize.



                      As a loop condition, there are cases where <= is qualitatively different from <, when it stops the compiler from proving that a loop is not infinite. This can make a big difference, disabling auto-vectorization.



                      Unsigned overflow is well-defined as base-2 wrap around, unlike signed overflow (UB). Signed loop counters are generally safe from this with compilers that optimize based on signed-overflow UB not happening: ++i <= size will always eventually become false. (What Every C Programmer Should Know About Undefined Behavior)



                      void foo(unsigned size) {
                      unsigned upper_bound = size - 1; // or any calculation that could produce UINT_MAX
                      for(unsigned i=0 ; i <= upper_bound ; i++)
                      ...


                      Compilers can only optimize in ways that preserve the (defined and legally observable) behaviour of the C++ source for all possible input values, except ones that lead to undefined behaviour.



                      (A simple i <= size would create the problem too, but I thought calculating an upper bound was a more realistic example of accidentally introducing the possibility of an infinite loop for an input you don't care about but which the compiler must consider.)



                      In this case, size=0 leads to upper_bound=UINT_MAX, and i <= UINT_MAX is always true. So this loop is infinite for size=0, and the compiler has to respect that even though you as the programmer probably never intend to pass size=0. If the compiler can inline this function into a caller where it can prove that size=0 is impossible, then great, it can optimize like it could for i < size.



                      Asm like if(!size) skip the loop; do{...}while(--size); is one normally-efficient way to optimize a for( i<size ) loop, if the actual value of i isn't needed inside the loop (Why are loops always compiled into "do...while" style (tail jump)?).



                      But that do{}while can't be infinite: if entered with size==0, we get 2^n iterations. (Iterating over all unsigned integers in a for loop C makes it possible to express a loop over all unsigned integers including zero, but it's not easy without a carry flag the way it is in asm.)



                      With wraparound of the loop counter being a possibility, modern compilers often just "give up", and don't optimize nearly as aggressively.



                      Example: sum of integers from 1 to n



                      Using unsigned i <= n defeats clang's idiom-recognition that optimizes sum(1 .. n) loops with a closed form based on Gauss's n * (n+1) / 2 formula.



                      unsigned sum_1_to_n_finite(unsigned n) {
                      unsigned total = 0;
                      for (unsigned i = 0 ; i < n+1 ; ++i)
                      total += i;
                      return total;
                      }


                      x86-64 asm from clang7.0 and gcc8.2 on the Godbolt compiler explorer



                       # clang7.0 -O3 closed-form
                      cmp edi, -1 # n passed in EDI: x86-64 System V calling convention
                      je .LBB1_1 # if (n == UINT_MAX) return 0; // C++ loop runs 0 times
                      # else fall through into the closed-form calc
                      mov ecx, edi # zero-extend n into RCX
                      lea eax, [rdi - 1] # n-1
                      imul rax, rcx # n * (n-1) # 64-bit
                      shr rax # n * (n-1) / 2
                      add eax, edi # n + (stuff / 2) = n * (n+1) / 2 # truncated to 32-bit
                      ret # computed without possible overflow of the product before right shifting
                      .LBB1_1:
                      xor eax, eax
                      ret


                      But for the naive version, we just get a dumb loop from clang.



                      unsigned sum_1_to_n_naive(unsigned n) {
                      unsigned total = 0;
                      for (unsigned i = 0 ; i<=n ; ++i)
                      total += i;
                      return total;
                      }


                      # clang7.0 -O3
                      sum_1_to_n(unsigned int):
                      xor ecx, ecx # i = 0
                      xor eax, eax # retval = 0
                      .LBB0_1: # do {
                      add eax, ecx # retval += i
                      add ecx, 1 # ++1
                      cmp ecx, edi
                      jbe .LBB0_1 # } while( i<n );
                      ret




                      GCC doesn't use a closed-form either way, so the choice of loop condition doesn't really hurt it; it auto-vectorizes with SIMD integer addition, running 4 i values in parallel in the elements of an XMM register.



                      # "naive" inner loop
                      .L3:
                      add eax, 1 # do {
                      paddd xmm0, xmm1 # vect_total_4.6, vect_vec_iv_.5
                      paddd xmm1, xmm2 # vect_vec_iv_.5, tmp114
                      cmp edx, eax # bnd.1, ivtmp.14 # bound and induction-variable tmp, I think.
                      ja .L3 #, # }while( n > i )

                      "finite" inner loop
                      # before the loop:
                      # xmm0 = 0 = totals
                      # xmm1 = {0,1,2,3} = i
                      # xmm2 = set1_epi32(4)
                      .L13: # do {
                      add eax, 1 # i++
                      paddd xmm0, xmm1 # total[0..3] += i[0..3]
                      paddd xmm1, xmm2 # i[0..3] += 4
                      cmp eax, edx
                      jne .L13 # }while( i != upper_limit );

                      then horizontal sum xmm0
                      and peeled cleanup for the last n%3 iterations, or something.


                      It also has a plain scalar loop which I think it uses for very small n, and/or for the infinite loop case.



                      BTW, both of these loops waste an instruction (and a uop on Sandybridge-family CPUs) on loop overhead. sub eax,1/jnz instead of add eax,1/cmp/jcc would be more efficient. 1 uop instead of 2 (after macro-fusion of sub/jcc or cmp/jcc). The code after both loops writes EAX unconditionally, so it's not using the final value of the loop counter.






                      share|improve this answer


























                      • Nice contrived example. What about your other comment about a potential effect on out of order execution due to EFLAGS use? Is it purely theoretical or can it actually happen that a JB leads to a better pipeline than a JBE?

                        – rustyx
                        Jan 20 at 12:05











                      • @rustyx: did I comment that somewhere under another answer? Compilers aren't going to emit code that causes partial-flag stalls, and certainly not for a C < or <=. But sure, test ecx,ecx / bt eax, 3 / jbe will jump if ZF is set (ecx==0), or if CF is set (bit 3 of EAX==1), causing a partial flag stall on most CPUs because the flags it reads don't all come from the last instruction to write any flags. On Sandybridge-family, it doesn't actually stall, just needs to insert a merging uop. cmp/test write all flags, but bt leaves ZF unmodified. felixcloutier.com/x86/bt

                        – Peter Cordes
                        Jan 20 at 12:23
















                      1














                      When I wrote this answer, I was only looking at the title question about < vs. <= in general, not the specific example of a constant a < 901 vs. a <= 900. Many compilers always shrink the magnitude of constants by converting between < and <=, e.g. because x86 immediate operand have a shorter 1-byte encoding for -128..127.



                      For ARM and especially AArch64, being able to encode as an immediate depends on being able to rotate a narrow field into any position in a word. So cmp w0, #0x00f000 would be encodeable, while cmp w0, #0x00effff might not be. So the make-it-smaller rule for comparison vs. a compile-time constant doesn't always apply for AArch64.





                      < vs. <= in general, including for runtime-variable conditions



                      In assembly language on most machines, a comparison for <= has the same cost as a comparison for <. This applies whether you're branching on it, booleanizing it to create a 0/1 integer, or using it as a predicate for a branchless select operation (like x86 CMOV). The other answers have only addressed this part of the question.



                      But this question is about the C++ operators, the input to the optimizer. Normally they're both equally efficient; the advice from the book sounds totally bogus because compilers can always transform the comparison that they implement in asm. But there is at least one exception where using <= can accidentally create something the compiler can't optimize.



                      As a loop condition, there are cases where <= is qualitatively different from <, when it stops the compiler from proving that a loop is not infinite. This can make a big difference, disabling auto-vectorization.



                      Unsigned overflow is well-defined as base-2 wrap around, unlike signed overflow (UB). Signed loop counters are generally safe from this with compilers that optimize based on signed-overflow UB not happening: ++i <= size will always eventually become false. (What Every C Programmer Should Know About Undefined Behavior)



                      void foo(unsigned size) {
                      unsigned upper_bound = size - 1; // or any calculation that could produce UINT_MAX
                      for(unsigned i=0 ; i <= upper_bound ; i++)
                      ...


                      Compilers can only optimize in ways that preserve the (defined and legally observable) behaviour of the C++ source for all possible input values, except ones that lead to undefined behaviour.



                      (A simple i <= size would create the problem too, but I thought calculating an upper bound was a more realistic example of accidentally introducing the possibility of an infinite loop for an input you don't care about but which the compiler must consider.)



                      In this case, size=0 leads to upper_bound=UINT_MAX, and i <= UINT_MAX is always true. So this loop is infinite for size=0, and the compiler has to respect that even though you as the programmer probably never intend to pass size=0. If the compiler can inline this function into a caller where it can prove that size=0 is impossible, then great, it can optimize like it could for i < size.



                      Asm like if(!size) skip the loop; do{...}while(--size); is one normally-efficient way to optimize a for( i<size ) loop, if the actual value of i isn't needed inside the loop (Why are loops always compiled into "do...while" style (tail jump)?).



                      But that do{}while can't be infinite: if entered with size==0, we get 2^n iterations. (Iterating over all unsigned integers in a for loop C makes it possible to express a loop over all unsigned integers including zero, but it's not easy without a carry flag the way it is in asm.)



                      With wraparound of the loop counter being a possibility, modern compilers often just "give up", and don't optimize nearly as aggressively.



                      Example: sum of integers from 1 to n



                      Using unsigned i <= n defeats clang's idiom-recognition that optimizes sum(1 .. n) loops with a closed form based on Gauss's n * (n+1) / 2 formula.



                      unsigned sum_1_to_n_finite(unsigned n) {
                      unsigned total = 0;
                      for (unsigned i = 0 ; i < n+1 ; ++i)
                      total += i;
                      return total;
                      }


                      x86-64 asm from clang7.0 and gcc8.2 on the Godbolt compiler explorer



                       # clang7.0 -O3 closed-form
                      cmp edi, -1 # n passed in EDI: x86-64 System V calling convention
                      je .LBB1_1 # if (n == UINT_MAX) return 0; // C++ loop runs 0 times
                      # else fall through into the closed-form calc
                      mov ecx, edi # zero-extend n into RCX
                      lea eax, [rdi - 1] # n-1
                      imul rax, rcx # n * (n-1) # 64-bit
                      shr rax # n * (n-1) / 2
                      add eax, edi # n + (stuff / 2) = n * (n+1) / 2 # truncated to 32-bit
                      ret # computed without possible overflow of the product before right shifting
                      .LBB1_1:
                      xor eax, eax
                      ret


                      But for the naive version, we just get a dumb loop from clang.



                      unsigned sum_1_to_n_naive(unsigned n) {
                      unsigned total = 0;
                      for (unsigned i = 0 ; i<=n ; ++i)
                      total += i;
                      return total;
                      }


                      # clang7.0 -O3
                      sum_1_to_n(unsigned int):
                      xor ecx, ecx # i = 0
                      xor eax, eax # retval = 0
                      .LBB0_1: # do {
                      add eax, ecx # retval += i
                      add ecx, 1 # ++1
                      cmp ecx, edi
                      jbe .LBB0_1 # } while( i<n );
                      ret




                      GCC doesn't use a closed-form either way, so the choice of loop condition doesn't really hurt it; it auto-vectorizes with SIMD integer addition, running 4 i values in parallel in the elements of an XMM register.



                      # "naive" inner loop
                      .L3:
                      add eax, 1 # do {
                      paddd xmm0, xmm1 # vect_total_4.6, vect_vec_iv_.5
                      paddd xmm1, xmm2 # vect_vec_iv_.5, tmp114
                      cmp edx, eax # bnd.1, ivtmp.14 # bound and induction-variable tmp, I think.
                      ja .L3 #, # }while( n > i )

                      "finite" inner loop
                      # before the loop:
                      # xmm0 = 0 = totals
                      # xmm1 = {0,1,2,3} = i
                      # xmm2 = set1_epi32(4)
                      .L13: # do {
                      add eax, 1 # i++
                      paddd xmm0, xmm1 # total[0..3] += i[0..3]
                      paddd xmm1, xmm2 # i[0..3] += 4
                      cmp eax, edx
                      jne .L13 # }while( i != upper_limit );

                      then horizontal sum xmm0
                      and peeled cleanup for the last n%3 iterations, or something.


                      It also has a plain scalar loop which I think it uses for very small n, and/or for the infinite loop case.



                      BTW, both of these loops waste an instruction (and a uop on Sandybridge-family CPUs) on loop overhead. sub eax,1/jnz instead of add eax,1/cmp/jcc would be more efficient. 1 uop instead of 2 (after macro-fusion of sub/jcc or cmp/jcc). The code after both loops writes EAX unconditionally, so it's not using the final value of the loop counter.






                      share|improve this answer


























                      • Nice contrived example. What about your other comment about a potential effect on out of order execution due to EFLAGS use? Is it purely theoretical or can it actually happen that a JB leads to a better pipeline than a JBE?

                        – rustyx
                        Jan 20 at 12:05











                      • @rustyx: did I comment that somewhere under another answer? Compilers aren't going to emit code that causes partial-flag stalls, and certainly not for a C < or <=. But sure, test ecx,ecx / bt eax, 3 / jbe will jump if ZF is set (ecx==0), or if CF is set (bit 3 of EAX==1), causing a partial flag stall on most CPUs because the flags it reads don't all come from the last instruction to write any flags. On Sandybridge-family, it doesn't actually stall, just needs to insert a merging uop. cmp/test write all flags, but bt leaves ZF unmodified. felixcloutier.com/x86/bt

                        – Peter Cordes
                        Jan 20 at 12:23














                      1












                      1








                      1







                      When I wrote this answer, I was only looking at the title question about < vs. <= in general, not the specific example of a constant a < 901 vs. a <= 900. Many compilers always shrink the magnitude of constants by converting between < and <=, e.g. because x86 immediate operand have a shorter 1-byte encoding for -128..127.



                      For ARM and especially AArch64, being able to encode as an immediate depends on being able to rotate a narrow field into any position in a word. So cmp w0, #0x00f000 would be encodeable, while cmp w0, #0x00effff might not be. So the make-it-smaller rule for comparison vs. a compile-time constant doesn't always apply for AArch64.





                      < vs. <= in general, including for runtime-variable conditions



                      In assembly language on most machines, a comparison for <= has the same cost as a comparison for <. This applies whether you're branching on it, booleanizing it to create a 0/1 integer, or using it as a predicate for a branchless select operation (like x86 CMOV). The other answers have only addressed this part of the question.



                      But this question is about the C++ operators, the input to the optimizer. Normally they're both equally efficient; the advice from the book sounds totally bogus because compilers can always transform the comparison that they implement in asm. But there is at least one exception where using <= can accidentally create something the compiler can't optimize.



                      As a loop condition, there are cases where <= is qualitatively different from <, when it stops the compiler from proving that a loop is not infinite. This can make a big difference, disabling auto-vectorization.



                      Unsigned overflow is well-defined as base-2 wrap around, unlike signed overflow (UB). Signed loop counters are generally safe from this with compilers that optimize based on signed-overflow UB not happening: ++i <= size will always eventually become false. (What Every C Programmer Should Know About Undefined Behavior)



                      void foo(unsigned size) {
                      unsigned upper_bound = size - 1; // or any calculation that could produce UINT_MAX
                      for(unsigned i=0 ; i <= upper_bound ; i++)
                      ...


                      Compilers can only optimize in ways that preserve the (defined and legally observable) behaviour of the C++ source for all possible input values, except ones that lead to undefined behaviour.



                      (A simple i <= size would create the problem too, but I thought calculating an upper bound was a more realistic example of accidentally introducing the possibility of an infinite loop for an input you don't care about but which the compiler must consider.)



                      In this case, size=0 leads to upper_bound=UINT_MAX, and i <= UINT_MAX is always true. So this loop is infinite for size=0, and the compiler has to respect that even though you as the programmer probably never intend to pass size=0. If the compiler can inline this function into a caller where it can prove that size=0 is impossible, then great, it can optimize like it could for i < size.



                      Asm like if(!size) skip the loop; do{...}while(--size); is one normally-efficient way to optimize a for( i<size ) loop, if the actual value of i isn't needed inside the loop (Why are loops always compiled into "do...while" style (tail jump)?).



                      But that do{}while can't be infinite: if entered with size==0, we get 2^n iterations. (Iterating over all unsigned integers in a for loop C makes it possible to express a loop over all unsigned integers including zero, but it's not easy without a carry flag the way it is in asm.)



                      With wraparound of the loop counter being a possibility, modern compilers often just "give up", and don't optimize nearly as aggressively.



                      Example: sum of integers from 1 to n



                      Using unsigned i <= n defeats clang's idiom-recognition that optimizes sum(1 .. n) loops with a closed form based on Gauss's n * (n+1) / 2 formula.



                      unsigned sum_1_to_n_finite(unsigned n) {
                      unsigned total = 0;
                      for (unsigned i = 0 ; i < n+1 ; ++i)
                      total += i;
                      return total;
                      }


                      x86-64 asm from clang7.0 and gcc8.2 on the Godbolt compiler explorer



                       # clang7.0 -O3 closed-form
                      cmp edi, -1 # n passed in EDI: x86-64 System V calling convention
                      je .LBB1_1 # if (n == UINT_MAX) return 0; // C++ loop runs 0 times
                      # else fall through into the closed-form calc
                      mov ecx, edi # zero-extend n into RCX
                      lea eax, [rdi - 1] # n-1
                      imul rax, rcx # n * (n-1) # 64-bit
                      shr rax # n * (n-1) / 2
                      add eax, edi # n + (stuff / 2) = n * (n+1) / 2 # truncated to 32-bit
                      ret # computed without possible overflow of the product before right shifting
                      .LBB1_1:
                      xor eax, eax
                      ret


                      But for the naive version, we just get a dumb loop from clang.



                      unsigned sum_1_to_n_naive(unsigned n) {
                      unsigned total = 0;
                      for (unsigned i = 0 ; i<=n ; ++i)
                      total += i;
                      return total;
                      }


                      # clang7.0 -O3
                      sum_1_to_n(unsigned int):
                      xor ecx, ecx # i = 0
                      xor eax, eax # retval = 0
                      .LBB0_1: # do {
                      add eax, ecx # retval += i
                      add ecx, 1 # ++1
                      cmp ecx, edi
                      jbe .LBB0_1 # } while( i<n );
                      ret




                      GCC doesn't use a closed-form either way, so the choice of loop condition doesn't really hurt it; it auto-vectorizes with SIMD integer addition, running 4 i values in parallel in the elements of an XMM register.



                      # "naive" inner loop
                      .L3:
                      add eax, 1 # do {
                      paddd xmm0, xmm1 # vect_total_4.6, vect_vec_iv_.5
                      paddd xmm1, xmm2 # vect_vec_iv_.5, tmp114
                      cmp edx, eax # bnd.1, ivtmp.14 # bound and induction-variable tmp, I think.
                      ja .L3 #, # }while( n > i )

                      "finite" inner loop
                      # before the loop:
                      # xmm0 = 0 = totals
                      # xmm1 = {0,1,2,3} = i
                      # xmm2 = set1_epi32(4)
                      .L13: # do {
                      add eax, 1 # i++
                      paddd xmm0, xmm1 # total[0..3] += i[0..3]
                      paddd xmm1, xmm2 # i[0..3] += 4
                      cmp eax, edx
                      jne .L13 # }while( i != upper_limit );

                      then horizontal sum xmm0
                      and peeled cleanup for the last n%3 iterations, or something.


                      It also has a plain scalar loop which I think it uses for very small n, and/or for the infinite loop case.



                      BTW, both of these loops waste an instruction (and a uop on Sandybridge-family CPUs) on loop overhead. sub eax,1/jnz instead of add eax,1/cmp/jcc would be more efficient. 1 uop instead of 2 (after macro-fusion of sub/jcc or cmp/jcc). The code after both loops writes EAX unconditionally, so it's not using the final value of the loop counter.






                      share|improve this answer















                      When I wrote this answer, I was only looking at the title question about < vs. <= in general, not the specific example of a constant a < 901 vs. a <= 900. Many compilers always shrink the magnitude of constants by converting between < and <=, e.g. because x86 immediate operand have a shorter 1-byte encoding for -128..127.



                      For ARM and especially AArch64, being able to encode as an immediate depends on being able to rotate a narrow field into any position in a word. So cmp w0, #0x00f000 would be encodeable, while cmp w0, #0x00effff might not be. So the make-it-smaller rule for comparison vs. a compile-time constant doesn't always apply for AArch64.





                      < vs. <= in general, including for runtime-variable conditions



                      In assembly language on most machines, a comparison for <= has the same cost as a comparison for <. This applies whether you're branching on it, booleanizing it to create a 0/1 integer, or using it as a predicate for a branchless select operation (like x86 CMOV). The other answers have only addressed this part of the question.



                      But this question is about the C++ operators, the input to the optimizer. Normally they're both equally efficient; the advice from the book sounds totally bogus because compilers can always transform the comparison that they implement in asm. But there is at least one exception where using <= can accidentally create something the compiler can't optimize.



                      As a loop condition, there are cases where <= is qualitatively different from <, when it stops the compiler from proving that a loop is not infinite. This can make a big difference, disabling auto-vectorization.



                      Unsigned overflow is well-defined as base-2 wrap around, unlike signed overflow (UB). Signed loop counters are generally safe from this with compilers that optimize based on signed-overflow UB not happening: ++i <= size will always eventually become false. (What Every C Programmer Should Know About Undefined Behavior)



                      void foo(unsigned size) {
                      unsigned upper_bound = size - 1; // or any calculation that could produce UINT_MAX
                      for(unsigned i=0 ; i <= upper_bound ; i++)
                      ...


                      Compilers can only optimize in ways that preserve the (defined and legally observable) behaviour of the C++ source for all possible input values, except ones that lead to undefined behaviour.



                      (A simple i <= size would create the problem too, but I thought calculating an upper bound was a more realistic example of accidentally introducing the possibility of an infinite loop for an input you don't care about but which the compiler must consider.)



                      In this case, size=0 leads to upper_bound=UINT_MAX, and i <= UINT_MAX is always true. So this loop is infinite for size=0, and the compiler has to respect that even though you as the programmer probably never intend to pass size=0. If the compiler can inline this function into a caller where it can prove that size=0 is impossible, then great, it can optimize like it could for i < size.



                      Asm like if(!size) skip the loop; do{...}while(--size); is one normally-efficient way to optimize a for( i<size ) loop, if the actual value of i isn't needed inside the loop (Why are loops always compiled into "do...while" style (tail jump)?).



                      But that do{}while can't be infinite: if entered with size==0, we get 2^n iterations. (Iterating over all unsigned integers in a for loop C makes it possible to express a loop over all unsigned integers including zero, but it's not easy without a carry flag the way it is in asm.)



                      With wraparound of the loop counter being a possibility, modern compilers often just "give up", and don't optimize nearly as aggressively.



                      Example: sum of integers from 1 to n



                      Using unsigned i <= n defeats clang's idiom-recognition that optimizes sum(1 .. n) loops with a closed form based on Gauss's n * (n+1) / 2 formula.



                      unsigned sum_1_to_n_finite(unsigned n) {
                      unsigned total = 0;
                      for (unsigned i = 0 ; i < n+1 ; ++i)
                      total += i;
                      return total;
                      }


                      x86-64 asm from clang7.0 and gcc8.2 on the Godbolt compiler explorer



                       # clang7.0 -O3 closed-form
                      cmp edi, -1 # n passed in EDI: x86-64 System V calling convention
                      je .LBB1_1 # if (n == UINT_MAX) return 0; // C++ loop runs 0 times
                      # else fall through into the closed-form calc
                      mov ecx, edi # zero-extend n into RCX
                      lea eax, [rdi - 1] # n-1
                      imul rax, rcx # n * (n-1) # 64-bit
                      shr rax # n * (n-1) / 2
                      add eax, edi # n + (stuff / 2) = n * (n+1) / 2 # truncated to 32-bit
                      ret # computed without possible overflow of the product before right shifting
                      .LBB1_1:
                      xor eax, eax
                      ret


                      But for the naive version, we just get a dumb loop from clang.



                      unsigned sum_1_to_n_naive(unsigned n) {
                      unsigned total = 0;
                      for (unsigned i = 0 ; i<=n ; ++i)
                      total += i;
                      return total;
                      }


                      # clang7.0 -O3
                      sum_1_to_n(unsigned int):
                      xor ecx, ecx # i = 0
                      xor eax, eax # retval = 0
                      .LBB0_1: # do {
                      add eax, ecx # retval += i
                      add ecx, 1 # ++1
                      cmp ecx, edi
                      jbe .LBB0_1 # } while( i<n );
                      ret




                      GCC doesn't use a closed-form either way, so the choice of loop condition doesn't really hurt it; it auto-vectorizes with SIMD integer addition, running 4 i values in parallel in the elements of an XMM register.



                      # "naive" inner loop
                      .L3:
                      add eax, 1 # do {
                      paddd xmm0, xmm1 # vect_total_4.6, vect_vec_iv_.5
                      paddd xmm1, xmm2 # vect_vec_iv_.5, tmp114
                      cmp edx, eax # bnd.1, ivtmp.14 # bound and induction-variable tmp, I think.
                      ja .L3 #, # }while( n > i )

                      "finite" inner loop
                      # before the loop:
                      # xmm0 = 0 = totals
                      # xmm1 = {0,1,2,3} = i
                      # xmm2 = set1_epi32(4)
                      .L13: # do {
                      add eax, 1 # i++
                      paddd xmm0, xmm1 # total[0..3] += i[0..3]
                      paddd xmm1, xmm2 # i[0..3] += 4
                      cmp eax, edx
                      jne .L13 # }while( i != upper_limit );

                      then horizontal sum xmm0
                      and peeled cleanup for the last n%3 iterations, or something.


                      It also has a plain scalar loop which I think it uses for very small n, and/or for the infinite loop case.



                      BTW, both of these loops waste an instruction (and a uop on Sandybridge-family CPUs) on loop overhead. sub eax,1/jnz instead of add eax,1/cmp/jcc would be more efficient. 1 uop instead of 2 (after macro-fusion of sub/jcc or cmp/jcc). The code after both loops writes EAX unconditionally, so it's not using the final value of the loop counter.







                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Jan 20 at 11:36

























                      answered Jan 20 at 11:30









                      Peter CordesPeter Cordes

                      126k18189321




                      126k18189321













                      • Nice contrived example. What about your other comment about a potential effect on out of order execution due to EFLAGS use? Is it purely theoretical or can it actually happen that a JB leads to a better pipeline than a JBE?

                        – rustyx
                        Jan 20 at 12:05











                      • @rustyx: did I comment that somewhere under another answer? Compilers aren't going to emit code that causes partial-flag stalls, and certainly not for a C < or <=. But sure, test ecx,ecx / bt eax, 3 / jbe will jump if ZF is set (ecx==0), or if CF is set (bit 3 of EAX==1), causing a partial flag stall on most CPUs because the flags it reads don't all come from the last instruction to write any flags. On Sandybridge-family, it doesn't actually stall, just needs to insert a merging uop. cmp/test write all flags, but bt leaves ZF unmodified. felixcloutier.com/x86/bt

                        – Peter Cordes
                        Jan 20 at 12:23



















                      • Nice contrived example. What about your other comment about a potential effect on out of order execution due to EFLAGS use? Is it purely theoretical or can it actually happen that a JB leads to a better pipeline than a JBE?

                        – rustyx
                        Jan 20 at 12:05











                      • @rustyx: did I comment that somewhere under another answer? Compilers aren't going to emit code that causes partial-flag stalls, and certainly not for a C < or <=. But sure, test ecx,ecx / bt eax, 3 / jbe will jump if ZF is set (ecx==0), or if CF is set (bit 3 of EAX==1), causing a partial flag stall on most CPUs because the flags it reads don't all come from the last instruction to write any flags. On Sandybridge-family, it doesn't actually stall, just needs to insert a merging uop. cmp/test write all flags, but bt leaves ZF unmodified. felixcloutier.com/x86/bt

                        – Peter Cordes
                        Jan 20 at 12:23

















                      Nice contrived example. What about your other comment about a potential effect on out of order execution due to EFLAGS use? Is it purely theoretical or can it actually happen that a JB leads to a better pipeline than a JBE?

                      – rustyx
                      Jan 20 at 12:05





                      Nice contrived example. What about your other comment about a potential effect on out of order execution due to EFLAGS use? Is it purely theoretical or can it actually happen that a JB leads to a better pipeline than a JBE?

                      – rustyx
                      Jan 20 at 12:05













                      @rustyx: did I comment that somewhere under another answer? Compilers aren't going to emit code that causes partial-flag stalls, and certainly not for a C < or <=. But sure, test ecx,ecx / bt eax, 3 / jbe will jump if ZF is set (ecx==0), or if CF is set (bit 3 of EAX==1), causing a partial flag stall on most CPUs because the flags it reads don't all come from the last instruction to write any flags. On Sandybridge-family, it doesn't actually stall, just needs to insert a merging uop. cmp/test write all flags, but bt leaves ZF unmodified. felixcloutier.com/x86/bt

                      – Peter Cordes
                      Jan 20 at 12:23





                      @rustyx: did I comment that somewhere under another answer? Compilers aren't going to emit code that causes partial-flag stalls, and certainly not for a C < or <=. But sure, test ecx,ecx / bt eax, 3 / jbe will jump if ZF is set (ecx==0), or if CF is set (bit 3 of EAX==1), causing a partial flag stall on most CPUs because the flags it reads don't all come from the last instruction to write any flags. On Sandybridge-family, it doesn't actually stall, just needs to insert a merging uop. cmp/test write all flags, but bt leaves ZF unmodified. felixcloutier.com/x86/bt

                      – Peter Cordes
                      Jan 20 at 12:23


















                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f12135518%2fis-faster-than%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Callistus III

                      Plistias Cous

                      Index Sanctorum