Why is 2 * (i * i) faster than 2 * i * i in Java?

Question

The following Java program takes on average between 0.50 secs and 0.55 secs to run:

public static void main(String[] args) {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        n += 2 * (i * i);
    }
    System.out.println((double) (System.nanoTime() - startTime) / 1000000000 + " s");
    System.out.println("n = " + n);
}

If I replace 2 * (i * i) with 2 * i * i, it takes between 0.60 and 0.65 secs to run. How come?

I ran each version of the program 15 times, alternating between the two. Here are the results:

 2*(i*i)  |  2*i*i
----------+----------
0.5183738 | 0.6246434
0.5298337 | 0.6049722
0.5308647 | 0.6603363
0.5133458 | 0.6243328
0.5003011 | 0.6541802
0.5366181 | 0.6312638
0.515149  | 0.6241105
0.5237389 | 0.627815
0.5249942 | 0.6114252
0.5641624 | 0.6781033
0.538412  | 0.6393969
0.5466744 | 0.6608845
0.531159  | 0.6201077
0.5048032 | 0.6511559
0.5232789 | 0.6544526

The fastest run of 2 * i * i took longer than the slowest run of 2 * (i * i). If they had the same efficiency, the probability of this happening would be less than 1/2^15 * 100% = 0.00305%.

I get similar results (slightly different numbers, but definitely noticeable and consistent gap, definitely more than sampling error) — Krease, Nov 23 '18 at 20:47
Also please see: https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java — lexicore, Nov 23 '18 at 20:56
@Krease Good that you caught my mistake. According to the new benchmark I ran `2 * i * i` is slower. I'll try running with Graal as well. — Jorn Vernee, Nov 23 '18 at 21:07
I was able to confirm the posters result: https://pastebin.com/KG0YRdnU When running that code, the 2 * (i * i) option consistently runs 10-20% faster. If I let both options do the same thing, the deviation is always < 5 % — Philipp Jahoda, Nov 23 '18 at 21:08
It's not the case on my run. I have 2 versions of this calculation in 2 methods calc1() and calc2(). In main, I call calc1() followed by calc2(). Then, I check with calc2() followed by calc1(). In both cases, the first call (whether calc1() or calc2()) is always taking longer. You can try. — elyor, Nov 23 '18 at 21:20
@JornVernee If it could help you add further information to your answer, the JVM instruction set for [`iload`](https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-6.html#jvms-6.5.iload) and [`imul`](https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-6.html#jvms-6.5.imul) — Naman, Nov 23 '18 at 21:21
I cannot reliably reproduce your result; for me the bracketed version is slower in 95% of cases. — Caius Jard, Nov 23 '18 at 21:25
@nullpointer To find out for real why one is faster than the other, we'd have to get the disassembly or Ideal graphs for those methods. The assembler is very annoying to try and figure out, so I'm trying to get an OpenJDK debug build which can output nice graphs. — Jorn Vernee, Nov 23 '18 at 21:29
it depends on many factors, such as JVM version, and so on. You can't always reproduce this. — elyor, Nov 23 '18 at 21:33
It's clear that performing the `i * i` multiplication first has something to do with the better performance, `2 * (i * i)` seems to be equally fast as `i * i * 2`. — Joakim Danielson, Nov 23 '18 at 21:44
You could rename your question to "_Why is `i * i * 2` faster than `2 * i * i`?_" for improved clarity that the issue is on the order of the operations. — Cœur, Nov 28 '18 at 04:02
I've done the same test and the results are totally opposite. https://stackoverflow.com/questions/53570864/why-is-2-x-x-faster-than-2-x-x-in-python — Waqas Gondal, Dec 01 '18 at 13:34

score 1218 · Accepted Answer · edited Jul 03 '20 at 07:55

There is a slight difference in the ordering of the bytecode.

2 * (i * i):

     iconst_2
     iload0
     iload0
     imul
     imul
     iadd

vs 2 * i * i:

     iconst_2
     iload0
     imul
     iload0
     imul
     iadd

At first sight this should not make a difference; if anything the second version is more optimal since it uses one slot less.

So we need to dig deeper into the lower level (JIT)¹.

Remember that JIT tends to unroll small loops very aggressively. Indeed we observe a 16x unrolling for the 2 * (i * i) case:

030   B2: # B2 B3 <- B1 B2  Loop: B2-B2 inner main of N18 Freq: 1e+006
030     addl    R11, RBP    # int
033     movl    RBP, R13    # spill
036     addl    RBP, #14    # int
039     imull   RBP, RBP    # int
03c     movl    R9, R13 # spill
03f     addl    R9, #13 # int
043     imull   R9, R9  # int
047     sall    RBP, #1
049     sall    R9, #1
04c     movl    R8, R13 # spill
04f     addl    R8, #15 # int
053     movl    R10, R8 # spill
056     movdl   XMM1, R8    # spill
05b     imull   R10, R8 # int
05f     movl    R8, R13 # spill
062     addl    R8, #12 # int
066     imull   R8, R8  # int
06a     sall    R10, #1
06d     movl    [rsp + #32], R10    # spill
072     sall    R8, #1
075     movl    RBX, R13    # spill
078     addl    RBX, #11    # int
07b     imull   RBX, RBX    # int
07e     movl    RCX, R13    # spill
081     addl    RCX, #10    # int
084     imull   RCX, RCX    # int
087     sall    RBX, #1
089     sall    RCX, #1
08b     movl    RDX, R13    # spill
08e     addl    RDX, #8 # int
091     imull   RDX, RDX    # int
094     movl    RDI, R13    # spill
097     addl    RDI, #7 # int
09a     imull   RDI, RDI    # int
09d     sall    RDX, #1
09f     sall    RDI, #1
0a1     movl    RAX, R13    # spill
0a4     addl    RAX, #6 # int
0a7     imull   RAX, RAX    # int
0aa     movl    RSI, R13    # spill
0ad     addl    RSI, #4 # int
0b0     imull   RSI, RSI    # int
0b3     sall    RAX, #1
0b5     sall    RSI, #1
0b7     movl    R10, R13    # spill
0ba     addl    R10, #2 # int
0be     imull   R10, R10    # int
0c2     movl    R14, R13    # spill
0c5     incl    R14 # int
0c8     imull   R14, R14    # int
0cc     sall    R10, #1
0cf     sall    R14, #1
0d2     addl    R14, R11    # int
0d5     addl    R14, R10    # int
0d8     movl    R10, R13    # spill
0db     addl    R10, #3 # int
0df     imull   R10, R10    # int
0e3     movl    R11, R13    # spill
0e6     addl    R11, #5 # int
0ea     imull   R11, R11    # int
0ee     sall    R10, #1
0f1     addl    R10, R14    # int
0f4     addl    R10, RSI    # int
0f7     sall    R11, #1
0fa     addl    R11, R10    # int
0fd     addl    R11, RAX    # int
100     addl    R11, RDI    # int
103     addl    R11, RDX    # int
106     movl    R10, R13    # spill
109     addl    R10, #9 # int
10d     imull   R10, R10    # int
111     sall    R10, #1
114     addl    R10, R11    # int
117     addl    R10, RCX    # int
11a     addl    R10, RBX    # int
11d     addl    R10, R8 # int
120     addl    R9, R10 # int
123     addl    RBP, R9 # int
126     addl    RBP, [RSP + #32 (32-bit)]   # int
12a     addl    R13, #16    # int
12e     movl    R11, R13    # spill
131     imull   R11, R13    # int
135     sall    R11, #1
138     cmpl    R13, #999999985
13f     jl     B2   # loop end  P=1.000000 C=6554623.000000

We see that there is 1 register that is "spilled" onto the stack.

And for the 2 * i * i version:

05a   B3: # B2 B4 <- B1 B2  Loop: B3-B2 inner main of N18 Freq: 1e+006
05a     addl    RBX, R11    # int
05d     movl    [rsp + #32], RBX    # spill
061     movl    R11, R8 # spill
064     addl    R11, #15    # int
068     movl    [rsp + #36], R11    # spill
06d     movl    R11, R8 # spill
070     addl    R11, #14    # int
074     movl    R10, R9 # spill
077     addl    R10, #16    # int
07b     movdl   XMM2, R10   # spill
080     movl    RCX, R9 # spill
083     addl    RCX, #14    # int
086     movdl   XMM1, RCX   # spill
08a     movl    R10, R9 # spill
08d     addl    R10, #12    # int
091     movdl   XMM4, R10   # spill
096     movl    RCX, R9 # spill
099     addl    RCX, #10    # int
09c     movdl   XMM6, RCX   # spill
0a0     movl    RBX, R9 # spill
0a3     addl    RBX, #8 # int
0a6     movl    RCX, R9 # spill
0a9     addl    RCX, #6 # int
0ac     movl    RDX, R9 # spill
0af     addl    RDX, #4 # int
0b2     addl    R9, #2  # int
0b6     movl    R10, R14    # spill
0b9     addl    R10, #22    # int
0bd     movdl   XMM3, R10   # spill
0c2     movl    RDI, R14    # spill
0c5     addl    RDI, #20    # int
0c8     movl    RAX, R14    # spill
0cb     addl    RAX, #32    # int
0ce     movl    RSI, R14    # spill
0d1     addl    RSI, #18    # int
0d4     movl    R13, R14    # spill
0d7     addl    R13, #24    # int
0db     movl    R10, R14    # spill
0de     addl    R10, #26    # int
0e2     movl    [rsp + #40], R10    # spill
0e7     movl    RBP, R14    # spill
0ea     addl    RBP, #28    # int
0ed     imull   RBP, R11    # int
0f1     addl    R14, #30    # int
0f5     imull   R14, [RSP + #36 (32-bit)]   # int
0fb     movl    R10, R8 # spill
0fe     addl    R10, #11    # int
102     movdl   R11, XMM3   # spill
107     imull   R11, R10    # int
10b     movl    [rsp + #44], R11    # spill
110     movl    R10, R8 # spill
113     addl    R10, #10    # int
117     imull   RDI, R10    # int
11b     movl    R11, R8 # spill
11e     addl    R11, #8 # int
122     movdl   R10, XMM2   # spill
127     imull   R10, R11    # int
12b     movl    [rsp + #48], R10    # spill
130     movl    R10, R8 # spill
133     addl    R10, #7 # int
137     movdl   R11, XMM1   # spill
13c     imull   R11, R10    # int
140     movl    [rsp + #52], R11    # spill
145     movl    R11, R8 # spill
148     addl    R11, #6 # int
14c     movdl   R10, XMM4   # spill
151     imull   R10, R11    # int
155     movl    [rsp + #56], R10    # spill
15a     movl    R10, R8 # spill
15d     addl    R10, #5 # int
161     movdl   R11, XMM6   # spill
166     imull   R11, R10    # int
16a     movl    [rsp + #60], R11    # spill
16f     movl    R11, R8 # spill
172     addl    R11, #4 # int
176     imull   RBX, R11    # int
17a     movl    R11, R8 # spill
17d     addl    R11, #3 # int
181     imull   RCX, R11    # int
185     movl    R10, R8 # spill
188     addl    R10, #2 # int
18c     imull   RDX, R10    # int
190     movl    R11, R8 # spill
193     incl    R11 # int
196     imull   R9, R11 # int
19a     addl    R9, [RSP + #32 (32-bit)]    # int
19f     addl    R9, RDX # int
1a2     addl    R9, RCX # int
1a5     addl    R9, RBX # int
1a8     addl    R9, [RSP + #60 (32-bit)]    # int
1ad     addl    R9, [RSP + #56 (32-bit)]    # int
1b2     addl    R9, [RSP + #52 (32-bit)]    # int
1b7     addl    R9, [RSP + #48 (32-bit)]    # int
1bc     movl    R10, R8 # spill
1bf     addl    R10, #9 # int
1c3     imull   R10, RSI    # int
1c7     addl    R10, R9 # int
1ca     addl    R10, RDI    # int
1cd     addl    R10, [RSP + #44 (32-bit)]   # int
1d2     movl    R11, R8 # spill
1d5     addl    R11, #12    # int
1d9     imull   R13, R11    # int
1dd     addl    R13, R10    # int
1e0     movl    R10, R8 # spill
1e3     addl    R10, #13    # int
1e7     imull   R10, [RSP + #40 (32-bit)]   # int
1ed     addl    R10, R13    # int
1f0     addl    RBP, R10    # int
1f3     addl    R14, RBP    # int
1f6     movl    R10, R8 # spill
1f9     addl    R10, #16    # int
1fd     cmpl    R10, #999999985
204     jl     B2   # loop end  P=1.000000 C=7419903.000000

Here we observe much more "spilling" and more accesses to the stack [RSP + ...], due to more intermediate results that need to be preserved.

Thus the answer to the question is simple: 2 * (i * i) is faster than 2 * i * i because the JIT generates more optimal assembly code for the first case.

But of course it is obvious that neither the first nor the second version is any good; the loop could really benefit from vectorization, since any x86-64 CPU has at least SSE2 support.

So it's an issue of the optimizer; as is often the case, it unrolls too aggressively and shoots itself in the foot, all the while missing out on various other opportunities.

In fact, modern x86-64 CPUs break down the instructions further into micro-ops (µops) and with features like register renaming, µop caches and loop buffers, loop optimization takes a lot more finesse than a simple unrolling for optimal performance. According to Agner Fog's optimization guide:

The gain in performance due to the µop cache can be quite considerable if the average instruction length is more than 4 bytes. The following methods of optimizing the use of the µop cache may be considered:

Make sure that critical loops are small enough to fit into the µop cache.

Align the most critical loop entries and function entries by 32.

Avoid unnecessary loop unrolling.

Avoid instructions that have extra load time
. . .

Regarding those load times - even the fastest L1D hit costs 4 cycles, an extra register and µop, so yes, even a few accesses to memory will hurt performance in tight loops.

But back to the vectorization opportunity - to see how fast it can be, we can compile a similar C application with GCC, which outright vectorizes it (AVX2 is shown, SSE2 is similar)²:

  vmovdqa ymm0, YMMWORD PTR .LC0[rip]
  vmovdqa ymm3, YMMWORD PTR .LC1[rip]
  xor eax, eax
  vpxor xmm2, xmm2, xmm2
.L2:
  vpmulld ymm1, ymm0, ymm0
  inc eax
  vpaddd ymm0, ymm0, ymm3
  vpslld ymm1, ymm1, 1
  vpaddd ymm2, ymm2, ymm1
  cmp eax, 125000000      ; 8 calculations per iteration
  jne .L2
  vmovdqa xmm0, xmm2
  vextracti128 xmm2, ymm2, 1
  vpaddd xmm2, xmm0, xmm2
  vpsrldq xmm0, xmm2, 8
  vpaddd xmm0, xmm2, xmm0
  vpsrldq xmm1, xmm0, 4
  vpaddd xmm0, xmm0, xmm1
  vmovd eax, xmm0
  vzeroupper

With run times:

SSE: 0.24 s, or 2 times as fast.
AVX: 0.15 s, or 3 times as fast.
AVX2: 0.08 s, or 5 times as fast.

¹ _{To get JIT generated assembly output, get a debug JVM and run with -XX:+PrintOptoAssembly}

² _{The C version is compiled with the -fwrapv flag, which enables GCC to treat signed integer overflow as a two's-complement wrap-around.}

The Java and C examples use quite different register names. Are both example using the AMD64 ISA? — kasperd, Nov 25 '18 at 17:11
The single biggest problem the optimizer encounters in the C example is the undefined behavior invoked by signed integer overflow. Which, otherwise, would probably result in simply loading a constant as the whole loop can be calculated at compiletime. — Damon, Nov 25 '18 at 18:28
@Damon Why would undefined behavior be a problem for the optimizer? If the optimizer sees it overflows when trying to calculate the result, that just means it can optimize it however it wants, because the behavior is undefined. — , Nov 25 '18 at 18:51
SSE4 would be similar (using `pmulld`), but SSE2 has to use `pmuludq` which only does two 32x32 => 64-bit multiplies per instruction. (And unfortunately it does a bunch of shifting and shuffling to pack it back down, instead of simply accumulating a vector of 2 results.) https://gcc.godbolt.org/z/buYR9o Even more unfortunately, it doesn't sink the `* 2` out of the loop for scalar or vector! — Peter Cordes, Nov 25 '18 at 22:17
@Runemoro: if the optimizer proves that calling the function will inevitably result in undefined behaviour, it could choose to assume that the function will never be called, and emit no body for it. Or emit just a `ret` instruction, or emit a label and *no* ret instruction so execution just falls through. GCC does in fact behave this was sometimes when it encounters UB, though. For example: [why ret disappear with optimization?](https://stackoverflow.com/a/52967037). You definitely want to compile well-formed code to be sure the asm is sane. — Peter Cordes, Nov 25 '18 at 22:26
4c L1d load-use latency is not a factor here. RSP is constant the whole time, so out-of-order execution can run the load early enough to have the data ready. The cost of spill/reload is all in the extra uops it costs. Store/reload store-forwarding latency (3 to 5 cycles) is separate from L1d cache hit latency, and is a possible problem, but I don't think that's happening here. The loop takes more than 5 cycles per iteration, so it's not a bottleneck. And I don't think store throughput is a bottleneck either. — Peter Cordes, Nov 25 '18 at 22:32
It's probably just a front-end uop throughput bottleneck because of the inefficient code-gen. It's not even using LEA as a peephole for `mov` / `add-immediate`. e.g. `movl RBX, R9` / `addl RBX, #8` should be `leal ebx, [r9 + 8]`, 1 uop to copy-and-add. Or `leal ebx, [r9 + r9 + 16]` to do `ebx = 2*(r9+8)`. So yeah, unrolling to the point of spilling is dumb, and so is naive braindead codegen that doesn't take advantage of integer identities and associative integer math. — Peter Cordes, Nov 25 '18 at 22:38
@kasperd The register names are different because the vector instructions use *an entirely different set of registers* to the normal instructions. — Martin Bonner supports Monica, Nov 26 '18 at 15:54
@MartinBonner At the time I wrote my comment there were no vector instructions in this answer. [This](https://stackoverflow.com/revisions/53453588/2) is the version of the answer I was commenting on. — kasperd, Nov 26 '18 at 15:58
@Runemoro, the `-fwrapv` option is about *restricting* what optimizations can be performed. Since signed integer overflow is undefined behavior, and the optimizer can prove that execution of the program will inevitably encounter a signed integer overflow, it's free to optimize in any way it feels fit, up to and including emitting an empty program, or making demons fly out of your nose. The `-fwrapv` option turns it into implementation-defined behavior, and restricts the optimizer to producing code that actually does the math, or at least produces the same results as if the math had been done. — Mark, Nov 28 '18 at 00:17
Vectorization for sequential reduction was disabled in C2 (https://bugs.openjdk.java.net/browse/JDK-8078563), but is now being considered for re-enabling (https://bugs.openjdk.java.net/browse/JDK-8188313). — pron, Nov 30 '18 at 14:31
This is a really interesting answer but it does make you wonder: *why* does it generate different instructions for each case? — user20349, Dec 02 '18 at 10:52
Note that [Clang does something miraculous here](https://godbolt.org/z/6ANRgO) either way you put the parentheses: it **completely optimizes out the loop** using the closed form 1² + 2² + ... + n² = n(n+1)(2n+1)/6, turning O(n) code into O(1) code! — David Zhang, Dec 03 '18 at 05:56
I've got some interesting [results](https://stackoverflow.com/questions/53452713/why-is-2-i-i-faster-than-2-i-i-in-java/53576845#53576845) after switching off loop unrolling at all. — Oleksandr Pyrohov, Dec 04 '18 at 00:37
The worst part of all of this is that at no point do any of these components realize that multiplying by `2` can be performed much faster by performing a `left-shift`. I copied the example from the question and performed all 3 versions at once, and it's clear from my testing that `2 * i * i` runs fastest (in opposition to the original test case) and that the `(i * i) << 1` case runs the *slowest*, which is shocking to me, honestly. I happen to be using OpenJDK 11 on MacOS. I didn't analyze the JIT's output. — Christopher Schultz, Dec 04 '18 at 15:01
I also noticed that performing the test in the `main` method pretty much avoids the JIT compiler entirely. I moved the code from `main` into a `doit` method and called that method twice from `main`, then a short `Thread.sleep` and then called it one more time, and the difference is *six orders of magnitude*. First time through the `doit` method: `v1: 2755623 ns v2: 3066809 ns v3: 3701626 ns`. After the JIT does its job: `v1: 46 ns v2: 44 ns v3: 46 ns`. At this point, I think my CPU is too fast to draw any useful conclusions about the differences between these techniques. — Christopher Schultz, Dec 04 '18 at 15:05
@ChristopherSchultz: the `2* (i*i)` version has lots of `sall R14, #1` in the asm. That's a shift left by 1 of the 32-bit register `r14d`. (Arithmetic and logical left shift are synonyms- different mnemonics for the same opcode: http://felixcloutier.com/x86/SAL:SAR:SHL:SHR.html). It's potentially slower than `addl r14,r14`, though, because it can't run on as many execution ports on typical x86 CPUs. (e.g. 1 or 2 per clock throughput on Skylake vs. 4 per clock). Anyway, like I commented earlier, the dumb thing is that the JIT compiler doesn't use x86's shift-and-add instruction (LEA). — Peter Cordes, Dec 06 '18 at 10:18
I think the `2*i*i` version is using `add` somehow, because there are still "only" 16 `imul` instructions and no `sal` or `shl`. I don't think it's sinking the `2*` out of the loop. — Peter Cordes, Dec 06 '18 at 10:23
@PeterCordes Oh, I didn't realize that `sall` was a shift, and it's interesting to note the (lack of) parallelizability due to the nuances of the x86-64 microarchitecture. Thanks for the details! — Christopher Schultz, Dec 06 '18 at 17:52
@rustyx I'm curious... how did you dump the native code coming from the JIT? I've never dug in any deeper in Java than the bytecode level. — Christopher Schultz, Dec 06 '18 at 17:53
Just noticed that on Haswell, `vpslld ymm1, ymm1, 1` can only run on the same port as `vpmulld`, port 0. Left-shift by 1 is the same operation as `vpaddd ymm1, ymm1, ymm1` (`x += 2` is the same as `x *= 2` or `x <<= 1`). `padd` runs on port 1 or port 5 on Haswell. So this loop is slower than it needs to be on Haswell. (Skylake can run SIMD integer multiply and shift on more ports, so probably no difference.). Also, `inc eax` / `cmp/jcc` wastes a uop vs. `dec eax/jnz`, which can macro-fuse on Intel CPUs into a dec-and-branch uop. (AMD, and older Intel, can only fuse test or cmp/jcc.) — Peter Cordes, Dec 08 '18 at 06:36

score 129 · Answer 2 · edited Jul 03 '20 at 08:22

(Editor's note: this answer is contradicted by evidence from looking at the asm, as shown by another answer. This was a guess backed up by some experiments, but it turned out not to be correct.)

When the multiplication is 2 * (i * i), the JVM is able to factor out the multiplication by 2 from the loop, resulting in this equivalent but more efficient code:

int n = 0;
for (int i = 0; i < 1000000000; i++) {
    n += i * i;
}
n *= 2;

but when the multiplication is (2 * i) * i, the JVM doesn't optimize it since the multiplication by a constant is no longer right before the n += addition.

Here are a few reasons why I think this is the case:

Adding an if (n == 0) n = 1 statement at the start of the loop results in both versions being as efficient, since factoring out the multiplication no longer guarantees that the result will be the same
The optimized version (by factoring out the multiplication by 2) is exactly as fast as the 2 * (i * i) version

Here is the test code that I used to draw these conclusions:

public static void main(String[] args) {
    long fastVersion = 0;
    long slowVersion = 0;
    long optimizedVersion = 0;
    long modifiedFastVersion = 0;
    long modifiedSlowVersion = 0;

    for (int i = 0; i < 10; i++) {
        fastVersion += fastVersion();
        slowVersion += slowVersion();
        optimizedVersion += optimizedVersion();
        modifiedFastVersion += modifiedFastVersion();
        modifiedSlowVersion += modifiedSlowVersion();
    }

    System.out.println("Fast version: " + (double) fastVersion / 1000000000 + " s");
    System.out.println("Slow version: " + (double) slowVersion / 1000000000 + " s");
    System.out.println("Optimized version: " + (double) optimizedVersion / 1000000000 + " s");
    System.out.println("Modified fast version: " + (double) modifiedFastVersion / 1000000000 + " s");
    System.out.println("Modified slow version: " + (double) modifiedSlowVersion / 1000000000 + " s");
}

private static long fastVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        n += 2 * (i * i);
    }
    return System.nanoTime() - startTime;
}

private static long slowVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        n += 2 * i * i;
    }
    return System.nanoTime() - startTime;
}

private static long optimizedVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        n += i * i;
    }
    n *= 2;
    return System.nanoTime() - startTime;
}

private static long modifiedFastVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        if (n == 0) n = 1;
        n += 2 * (i * i);
    }
    return System.nanoTime() - startTime;
}

private static long modifiedSlowVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        if (n == 0) n = 1;
        n += 2 * i * i;
    }
    return System.nanoTime() - startTime;
}

And here are the results:

Fast version: 5.7274411 s
Slow version: 7.6190804 s
Optimized version: 5.1348007 s
Modified fast version: 7.1492705 s
Modified slow version: 7.2952668 s

I think on the optimizedVersion, it should be `n *= 2000000000;` — StefansArya, Nov 24 '18 at 01:19
@StefansArya - No. Consider the case where the limit is 4, and we are trying to calculate `2*1*1 + 2*2*2 + 2*3*3`. It is obvious that calculating `1*1 + 2*2 + 3*3` and multiplying by 2 is correct, whereas multiply by 8 would not be. — Martin Bonner supports Monica, Nov 26 '18 at 15:57
@MartinBonner - Agreed, I think someone was explain it few days ago, but now his comment was gone. Thanks anyway for explaining. — StefansArya, Nov 26 '18 at 17:12
The math equation was just like this `2(1²) + 2(2²) + 2(3²) = 2(1² + 2² + 3²)`. That was very simple and I just forgot it because the loop increment. — StefansArya, Nov 26 '18 at 17:22
If you print out the assembly using a debug jvm, this does not appear to be correct. You will see a bunch of sall ... ,#1, which are multiplies by 2, in the loop. Interestingly, the *slower* version does not appear to have multiplies in the loop. — Daniel Berlin, Dec 01 '18 at 04:53
Why can the JVM factor out the 2 from `2 * (i * i)` but not from `(2 * i) * i`? I would think they are equivalent (that may be my bad assumption). If so, wouldn't the JVM canonicalize the expression before optimising? — RedSpikeyThing, Dec 01 '18 at 14:50

score 40 · Answer 3 · edited Apr 24 '19 at 15:10

Byte codes: https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html Byte codes Viewer: https://github.com/Konloch/bytecode-viewer

On my JDK (Windows 10 64 bit, 1.8.0_65-b17) I can reproduce and explain:

public static void main(String[] args) {
    int repeat = 10;
    long A = 0;
    long B = 0;
    for (int i = 0; i < repeat; i++) {
        A += test();
        B += testB();
    }

    System.out.println(A / repeat + " ms");
    System.out.println(B / repeat + " ms");
}


private static long test() {
    int n = 0;
    for (int i = 0; i < 1000; i++) {
        n += multi(i);
    }
    long startTime = System.currentTimeMillis();
    for (int i = 0; i < 1000000000; i++) {
        n += multi(i);
    }
    long ms = (System.currentTimeMillis() - startTime);
    System.out.println(ms + " ms A " + n);
    return ms;
}


private static long testB() {
    int n = 0;
    for (int i = 0; i < 1000; i++) {
        n += multiB(i);
    }
    long startTime = System.currentTimeMillis();
    for (int i = 0; i < 1000000000; i++) {
        n += multiB(i);
    }
    long ms = (System.currentTimeMillis() - startTime);
    System.out.println(ms + " ms B " + n);
    return ms;
}

private static int multiB(int i) {
    return 2 * (i * i);
}

private static int multi(int i) {
    return 2 * i * i;
}

Output:

...
405 ms A 785527736
327 ms B 785527736
404 ms A 785527736
329 ms B 785527736
404 ms A 785527736
328 ms B 785527736
404 ms A 785527736
328 ms B 785527736
410 ms
333 ms

So why? The byte code is this:

 private static multiB(int arg0) { // 2 * (i * i)
     <localVar:index=0, name=i , desc=I, sig=null, start=L1, end=L2>

     L1 {
         iconst_2
         iload0
         iload0
         imul
         imul
         ireturn
     }
     L2 {
     }
 }

 private static multi(int arg0) { // 2 * i * i
     <localVar:index=0, name=i , desc=I, sig=null, start=L1, end=L2>

     L1 {
         iconst_2
         iload0
         imul
         iload0
         imul
         ireturn
     }
     L2 {
     }
 }

The difference being: With brackets (2 * (i * i)):

push const stack
push local on stack
push local on stack
multiply top of stack
multiply top of stack

Without brackets (2 * i * i):

push const stack
push local on stack
multiply top of stack
push local on stack
multiply top of stack

Loading all on the stack and then working back down is faster than switching between putting on the stack and operating on it.

But why is push-push-multiply-multiply faster than push-multiply-push-multiply? — m0skit0, Dec 01 '18 at 12:04
@m0skit0: Indeed, the answer isn't explained by bytecode, only by looking at actual JITed x86-64 asm. JITing with the same 16x unroll for a machine with more registers (like AArch64 or PowerPC) would probably show no difference on those other ISAs, unlike x86-64 or probably 32-bit ARM. It's not inherently faster to push everything and work back down in Java bytecode, or at least this Q&A doesn't prove that. It happens to be slower in this case where the JIT compiler trips over itself worse in one case than another. — Peter Cordes, Jul 03 '20 at 08:24

score 35 · Answer 4 · answered Nov 25 '18 at 18:18

Kasperd asked in a comment of the accepted answer:

The Java and C examples use quite different register names. Are both example using the AMD64 ISA?

xor edx, edx
xor eax, eax
.L2:
mov ecx, edx
imul ecx, edx
add edx, 1
lea eax, [rax+rcx*2]
cmp edx, 1000000000
jne .L2

I don't have enough reputation to answer this in the comments, but these are the same ISA. It's worth pointing out that the GCC version uses 32-bit integer logic and the JVM compiled version uses 64-bit integer logic internally.

R8 to R15 are just new X86_64 registers. EAX to EDX are the lower parts of the RAX to RDX general purpose registers. The important part in the answer is that the GCC version is not unrolled. It simply executes one round of the loop per actual machine code loop. While the JVM version has 16 rounds of the loop in one physical loop (based on rustyx answer, I did not reinterpret the assembly). This is one of the reasons why there are more registers being used since the loop body is actually 16 times longer.

Too bad gcc doesn't notice it can sink the `*2` out of the loop. Although in this case, it's not even a win to do that, because it's doing it for free with LEA. On Intel CPUs, `lea eax, [rax+rcx*2]` has the same 1c latency as `add eax,ecx`. However, on AMD CPUs any scaled index increases LEA latency to 2 cycles. So the loop-carried dependency chain lengthens to 2 cycles, becoming the bottleneck on Ryzen. (`imul ecx,edx` throughput is 1 per clock on Ryzen, and on Intel). — Peter Cordes, Nov 30 '18 at 05:46

score 31 · Answer 5 · edited Apr 24 '19 at 15:20

31

While not directly related to the question's environment, just for the curiosity, I did the same test on .NET Core 2.1, x64, release mode.

Here is the interesting result, confirming similar phonomena (other way around) happening over the dark side of the force. Code:

static void Main(string[] args)
{
    Stopwatch watch = new Stopwatch();

    Console.WriteLine("2 * (i * i)");

    for (int a = 0; a < 10; a++)
    {
        int n = 0;

        watch.Restart();

        for (int i = 0; i < 1000000000; i++)
        {
            n += 2 * (i * i);
        }

        watch.Stop();

        Console.WriteLine($"result:{n}, {watch.ElapsedMilliseconds} ms");
    }

    Console.WriteLine();
    Console.WriteLine("2 * i * i");

    for (int a = 0; a < 10; a++)
    {
        int n = 0;

        watch.Restart();

        for (int i = 0; i < 1000000000; i++)
        {
            n += 2 * i * i;
        }

        watch.Stop();

        Console.WriteLine($"result:{n}, {watch.ElapsedMilliseconds}ms");
    }
}

Result:

2 * (i * i)

result:119860736, 438 ms
result:119860736, 433 ms
result:119860736, 437 ms
result:119860736, 435 ms
result:119860736, 436 ms
result:119860736, 435 ms
result:119860736, 435 ms
result:119860736, 439 ms
result:119860736, 436 ms
result:119860736, 437 ms

2 * i * i

result:119860736, 417 ms
result:119860736, 417 ms
result:119860736, 417 ms
result:119860736, 418 ms
result:119860736, 418 ms
result:119860736, 417 ms
result:119860736, 418 ms
result:119860736, 416 ms
result:119860736, 417 ms
result:119860736, 418 ms

edited Apr 24 '19 at 15:20

Peter Mortensen

28,342
21
95
123

answered Nov 28 '18 at 08:12

Ünsal Ersöz

350
2
7

1

While this isn't an answer to the question, it does add value. That being said, if something is vital to your post, *please in-line it in the post* rather than linking to an off-site resource. Links go dead. – Jared Smith Nov 28 '18 at 13:54
1

@JaredSmith Thanks for the feedback. Considering the link you mention is the "result" link, that image is not an off-site source. I uploaded it to the stackoverflow via its own panel. – Ünsal Ersöz Nov 28 '18 at 14:32
1

It's a link to imgur, so yes, it is, it doesn't matter how you added the link. I fail to see what's so difficult about copy-pasting some console output. – Jared Smith Nov 28 '18 at 14:49
@JaredSmith Done. – Ünsal Ersöz Nov 28 '18 at 15:02
5

Except this is the other way around – leppie Nov 30 '18 at 14:55
Above it uses lea for case 1 and shl for case 2 for multiplication. – Onur Gumus Dec 01 '18 at 04:51
@JaredSmith: but ... stack.imgur.com links refer to Stack Exchange/Stack Overflow's own private instance of imgur, which should survive at least as long as Stack Overflow itself, since that's where the image uploader puts all images. (Naturally, that *doesn't* make it an appropriate replacement for text. Only that "links go dead" seems inapplicable.) – SamB Dec 01 '18 at 06:11
2

@SamB it's still on the imgur.com domain, which means it'll survive only for as long as imgur. – p91paul Dec 01 '18 at 10:22
@p91paul: Not really. Stack Overflow/Stack Exchange has already migrated all the imgur images once (from the main imgur instance to their own, SO/SE-exclusive instance), and they can do it again if that becomes necessary. (I mean, there might be an issue if somehow all of the stack.imgur.com servers explode at the same time, but that's probably not really all that likely.) They're not really considered "off-site" links. – SamB Dec 14 '18 at 02:14

score 21 · Answer 6 · answered Nov 23 '18 at 21:10

I got similar results:

2 * (i * i): 0.458765943 s, n=119860736
2 * i * i: 0.580255126 s, n=119860736

I got the SAME results if both loops were in the same program, or each was in a separate .java file/.class, executed on a separate run.

Finally, here is a javap -c -v <.java> decompile of each:

     3: ldc           #3                  // String 2 * (i * i):
     5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V
     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J
     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J
    11: lstore_1
    12: iconst_0
    13: istore_3
    14: iconst_0
    15: istore        4
    17: iload         4
    19: ldc           #6                  // int 1000000000
    21: if_icmpge     40
    24: iload_3
    25: iconst_2
    26: iload         4
    28: iload         4
    30: imul
    31: imul
    32: iadd
    33: istore_3
    34: iinc          4, 1
    37: goto          17

vs.

     3: ldc           #3                  // String 2 * i * i:
     5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V
     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J
    11: lstore_1
    12: iconst_0
    13: istore_3
    14: iconst_0
    15: istore        4
    17: iload         4
    19: ldc           #6                  // int 1000000000
    21: if_icmpge     40
    24: iload_3
    25: iconst_2
    26: iload         4
    28: imul
    29: iload         4
    31: imul
    32: iadd
    33: istore_3
    34: iinc          4, 1
    37: goto          17

FYI -

java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

A better answer and maybe you can vote to undelete - https://stackoverflow.com/a/53452836/1746118 ... Side note - I am not the downvoter anyway. — Naman, Nov 23 '18 at 21:11
@nullpointer - I agree. I'd definitely vote to undelete, if I could. I'd also like to "double upvote" stefan for giving a quantitative definition of "significant" — paulsm4, Nov 23 '18 at 21:14
That one was self-deleted since it measured the wrong thing - see that author's comment on the question above — Krease, Nov 23 '18 at 21:16
With JIT, the bytecode doesn't matter much. Show the JIT code. — rustyx, Nov 23 '18 at 21:58
Get a [debug jre](https://github.com/ojdkbuild/ojdkbuild/releases) and run with `-XX:+PrintOptoAssembly`. Or just use vtune or alike. — rustyx, Nov 23 '18 at 22:42
@ rustyx - If the problem is the JIT implementation ... then "getting a debug version" OF A COMPLETELY DIFFERENT JRE isn't necessarily going to help. Nevertheless: it sounds like what you found [above](https://stackoverflow.com/a/53453588/421195) with your JIT disassembly on your JRE also explains the behavior on the OP's JRE and mine. And also explains why other JRE's behave "differently". +1: thank you for the excellent detective work! — paulsm4, Nov 24 '18 at 08:06

Oleksandr Pyrohov · Answer 7 · 2019-09-19T19:03:29.457

Interesting observation using Java 11 and switching off loop unrolling with the following VM option:

-XX:LoopUnrollLimit=0

The loop with the 2 * (i * i) expression results in more compact native code¹:

L0001: add    eax,r11d
       inc    r8d
       mov    r11d,r8d
       imul   r11d,r8d
       shl    r11d,1h
       cmp    r8d,r10d
       jl     L0001

in comparison with the 2 * i * i version:

L0001: add    eax,r11d
       mov    r11d,r8d
       shl    r11d,1h
       add    r11d,2h
       inc    r8d
       imul   r11d,r8d
       cmp    r8d,r10d
       jl     L0001

Java version:

java version "11" 2018-09-25
Java(TM) SE Runtime Environment 18.9 (build 11+28)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11+28, mixed mode)

Benchmark results:

Benchmark          (size)  Mode  Cnt    Score     Error  Units
LoopTest.fast  1000000000  avgt    5  694,868 ±  36,470  ms/op
LoopTest.slow  1000000000  avgt    5  769,840 ± 135,006  ms/op

Benchmark source code:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@State(Scope.Thread)
@Fork(1)
public class LoopTest {

    @Param("1000000000") private int size;

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
            .include(LoopTest.class.getSimpleName())
            .jvmArgs("-XX:LoopUnrollLimit=0")
            .build();
        new Runner(opt).run();
    }

    @Benchmark
    public int slow() {
        int n = 0;
        for (int i = 0; i < size; i++)
            n += 2 * i * i;
        return n;
    }

    @Benchmark
    public int fast() {
        int n = 0;
        for (int i = 0; i < size; i++)
            n += 2 * (i * i);
        return n;
    }
}

^{1 - VM options used: -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:LoopUnrollLimit=0}

Wow, that's some braindead asm. Instead of incrementing `i` *before* copying it to calculate `2*i`, it does it after so it needs an extra `add r11d,2` instruction. (Plus it misses the `add same,same` peephole instead of `shl` by 1 (add runs on more ports). It also misses an LEA peephole for `x*2 + 2` (`lea r11d, [r8*2 + 2]`) if it really wants to do things in that order for some crazy instruction-scheduling reason. We could already see from the unrolled version that missing out on LEA was costing it a *lot* of uops, same as both loops here. — Peter Cordes, Dec 02 '18 at 02:50
`lea eax, [rax + r11 * 2]` would replace 2 instructions (in both loops) if the JIT compiler had time to look for that optimization in long-running loops. Any decent ahead-of-time compiler would find it. (Unless maybe tuning only for AMD, where scaled-index LEA has 2 cycle latency so maybe not worth it.) — Peter Cordes, Dec 02 '18 at 02:51

score 15 · Answer 8 · edited Apr 24 '19 at 15:14

I tried a JMH using the default archetype: I also added an optimized version based on Runemoro's explanation.

@State(Scope.Benchmark)
@Warmup(iterations = 2)
@Fork(1)
@Measurement(iterations = 10)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
//@BenchmarkMode({ Mode.All })
@BenchmarkMode(Mode.AverageTime)
public class MyBenchmark {
  @Param({ "100", "1000", "1000000000" })
  private int size;

  @Benchmark
  public int two_square_i() {
    int n = 0;
    for (int i = 0; i < size; i++) {
      n += 2 * (i * i);
    }
    return n;
  }

  @Benchmark
  public int square_i_two() {
    int n = 0;
    for (int i = 0; i < size; i++) {
      n += i * i;
    }
    return 2*n;
  }

  @Benchmark
  public int two_i_() {
    int n = 0;
    for (int i = 0; i < size; i++) {
      n += 2 * i * i;
    }
    return n;
  }
}

The result are here:

Benchmark                           (size)  Mode  Samples          Score   Score error  Units
o.s.MyBenchmark.square_i_two           100  avgt       10         58,062         1,410  ns/op
o.s.MyBenchmark.square_i_two          1000  avgt       10        547,393        12,851  ns/op
o.s.MyBenchmark.square_i_two    1000000000  avgt       10  540343681,267  16795210,324  ns/op
o.s.MyBenchmark.two_i_                 100  avgt       10         87,491         2,004  ns/op
o.s.MyBenchmark.two_i_                1000  avgt       10       1015,388        30,313  ns/op
o.s.MyBenchmark.two_i_          1000000000  avgt       10  967100076,600  24929570,556  ns/op
o.s.MyBenchmark.two_square_i           100  avgt       10         70,715         2,107  ns/op
o.s.MyBenchmark.two_square_i          1000  avgt       10        686,977        24,613  ns/op
o.s.MyBenchmark.two_square_i    1000000000  avgt       10  652736811,450  27015580,488  ns/op

On my PC (Core i7 860 - it is doing nothing much apart from reading on my smartphone):

n += i*i then n*2 is first
2 * (i * i) is second.

The JVM is clearly not optimizing the same way than a human does (based on Runemoro's answer).

Now then, reading bytecode: javap -c -v ./target/classes/org/sample/MyBenchmark.class

Differences between 2*(i*i) (left) and 2*i*i (right) here: https://www.diffchecker.com/cvSFppWI
Differences between 2*(i*i) and the optimized version here: https://www.diffchecker.com/I1XFu5dP

I am not expert on bytecode, but we iload_2 before we imul: that's probably where you get the difference: I can suppose that the JVM optimize reading i twice (i is already here, and there is no need to load it again) whilst in the 2*i*i it can't.

AFAICT bytecode is pretty irrelevant for performance, and I wouldn't try to estimate what's faster based on it. It's just the source code for the JIT compiler... sure can meaning-preserving reordering source code lines change the resulting code and it's efficiency, but that all pretty unpredictable. — maaartinus, Nov 26 '18 at 02:33

score 13 · Answer 9 · edited Apr 24 '19 at 15:22

More of an addendum. I did repro the experiment using the latest Java 8 JVM from IBM:

java version "1.8.0_191"
Java(TM) 2 Runtime Environment, Standard Edition (IBM build 1.8.0_191-b12 26_Oct_2018_18_45 Mac OS X x64(SR5 FP25))
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)

And this shows very similar results:

0.374653912 s
n = 119860736
0.447778698 s
n = 119860736

(second results using 2 * i * i).

Interestingly enough, when running on the same machine, but using Oracle Java:

Java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

results are on average a bit slower:

0.414331815 s
n = 119860736
0.491430656 s
n = 119860736

Long story short: even the minor version number of HotSpot matter here, as subtle differences within the JIT implementation can have notable effects.

Jorn Vernee · Answer 10 · 2018-12-23T14:04:23.903

The two methods of adding do generate slightly different byte code:

  17: iconst_2
  18: iload         4
  20: iload         4
  22: imul
  23: imul
  24: iadd

For 2 * (i * i) vs:

  17: iconst_2
  18: iload         4
  20: imul
  21: iload         4
  23: imul
  24: iadd

For 2 * i * i.

And when using a JMH benchmark like this:

@Warmup(iterations = 5, batchSize = 1)
@Measurement(iterations = 5, batchSize = 1)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class MyBenchmark {

    @Benchmark
    public int noBrackets() {
        int n = 0;
        for (int i = 0; i < 1000000000; i++) {
            n += 2 * i * i;
        }
        return n;
    }

    @Benchmark
    public int brackets() {
        int n = 0;
        for (int i = 0; i < 1000000000; i++) {
            n += 2 * (i * i);
        }
        return n;
    }

}

The difference is clear:

# JMH version: 1.21
# VM version: JDK 11, Java HotSpot(TM) 64-Bit Server VM, 11+28
# VM options: <none>

Benchmark                      (n)  Mode  Cnt    Score    Error  Units
MyBenchmark.brackets    1000000000  avgt    5  380.889 ± 58.011  ms/op
MyBenchmark.noBrackets  1000000000  avgt    5  512.464 ± 11.098  ms/op

What you observe is correct, and not just an anomaly of your benchmarking style (i.e. no warmup, see How do I write a correct micro-benchmark in Java?)

Running again with Graal:

# JMH version: 1.21
# VM version: JDK 11, Java HotSpot(TM) 64-Bit Server VM, 11+28
# VM options: -XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI -XX:+UseJVMCICompiler

Benchmark                      (n)  Mode  Cnt    Score    Error  Units
MyBenchmark.brackets    1000000000  avgt    5  335.100 ± 23.085  ms/op
MyBenchmark.noBrackets  1000000000  avgt    5  331.163 ± 50.670  ms/op

You see that the results are much closer, which makes sense, since Graal is an overall better performing, more modern, compiler.

So this is really just up to how well the JIT compiler is able to optimize a particular piece of code, and doesn't necessarily have a logical reason to it.

Why is 2 * (i * i) faster than 2 * i * i in Java?

10 Answers10

Linked