7

When I try to optimize my code, for a very long time I've just been using a rule of thumb that addition and subtraction are worth 1, multiplication and division are worth 3, squaring is worth 3 (I rarely use the more general pow function so I have no rule of thumb for it), and square roots are worth 10. (And I assume squaring a number is just a multiplication, so worth 3.)

Here's an example from a 2D orbital simulation. To calculate and apply acceleration from gravity, first I get distance from the ship to the center of earth, then calculate the acceleration.

D = sqrt( sqr(Ship.x - Earth.x) + sqr(Ship.y - Earth.y) ); // this is worth 19
A = G*Earth.mass/sqr(D);                                   // this is worth 9, total is 28

However, notice that in calculating D, you take a square root, but when using it in the next calculation, you square it. Therefore you can just do this:

A = G*Earth.mass/( sqr(Ship.x - Earth.x) + sqr(Ship.y - Earth.y) ); // this is worth 15

So if my rule of thumb is true, I almost cut in half the cycle time.

However, I cannot even remember where I heard that rule before. I'd like to ask what is the actual cycle times for those basic arithmetic operations?

Assumptions:

  • everything is a 64-bit floating number in x64 architecture.
  • everything is already loaded into registers, so no worrying about hits and misses from caches or memory.
  • no interrupts to the CPU
  • no if/branching logic such as look ahead prediction

Edit: I suppose what I'm really trying to do is look inside the ALU and only count the cycle time of its logic for the 6 operations. If there is still variance within that, please explain what and why.

Note: I did not see any tags for machine code, so I chose the next closest thing, assembly. To be clear, I am talking about actual machine code operations in x64 architecture. Thus it doesn't matter whether those lines of code I wrote are in C#, C, Javascript, whatever. I'm sure each high-level language will have its own varying times so I don't wanna get into an argument over that. I think it's a shame that there's no machine code tag because when talking about performance and/or operation, you really need to get down into it.

halfer
  • 18,701
  • 13
  • 79
  • 158
DrZ214
  • 450
  • 4
  • 17
  • 2
    It's trickier than you're assuming (cost is not 1-dimensional, and varies), but division is worse than you think (about as bad as square root). You can look up recent details [here](http://users.atw.hu/instlatx64/) or through other links that you can find in the x86 tag wiki – harold Sep 30 '17 at 18:35
  • This 'rule' of yours is very rough and in the case of division, outright wrong. 'cycle time' is also quite tricky to nail down on a modern CPU. Your best bet is to measure while keeping in mind the basic outline - addition and subtraction are cheapest, etc. – pvg Sep 30 '17 at 18:38
  • @harold Ah, yes I forgot about memory hits and misses. I forgot to include the assumption that the values are already loaded into two registers. Will edit. – DrZ214 Sep 30 '17 at 18:39
  • https://youtu.be/hiVZCNs6Jzs here he talks about addition multiplicatiin. here he talks about memory and the nbody problem specifically https://youtu.be/3gzqChavjk4 – PrincePolka Sep 30 '17 at 19:15
  • Execution time is not as linear as you are trying to make it. Even if you were to know how many cycles it takes execution through an x86(-64) is heavily dependent on many factors, the instructions around the instruction under test, the cache state, memory, etc...It has been a very long time since one could count the operations (assuming the high level language directly related to the machine code) and get a rough idea of execution time or optimization between different solutions. – old_timer Sep 30 '17 at 19:25
  • (sure some other processors like PIC and a few others are more deterministic) – old_timer Sep 30 '17 at 19:27
  • @old_timer Well my assumption is that the registers are already loaded. Can you name some other factors that cause variance? I hope you're not thinking about interrupts to the CPU or if/branch logic, but I guess I'll explicitly add those assumption in my bullet points. I supposed what I'm really trying to do is figure out how long it takes the ALU to do its thing, and not worry about what the CPU has to do to manage it. I would think the ALU logic and cycles are countable/measurable, at least. – DrZ214 Sep 30 '17 at 19:35
  • You mention 6 operations, but list only add, sub, div, mul and sqrt. What's the 6th? – BeeOnRope Sep 30 '17 at 19:52
  • @BeeOnRope Square, or I suppose the general `pow` function for any exponent. Will edit. – DrZ214 Sep 30 '17 at 19:59
  • @DrZ214 - well there is no `pow`-type instruction in x86, but you can look at software solutions [here](https://stackoverflow.com/q/4638473/149138). – BeeOnRope Sep 30 '17 at 20:14
  • @DrZ214 I guess I dont see the value in that, the number of clocks for the alu operation are the least interesting part of the problem and dont affect your performance, assume one clock but divide for example might be a few stages in the pipe but being a pipe, being superscaler, you wont see that. you are using optimization and performance tags as if you are interested in both, but dont really appear to be interested in either. – old_timer Oct 01 '17 at 00:08
  • Division has low throughput, but an occasional division can be as cheap as a mul or add if out-of-order execution can hide the extra latency: https://stackoverflow.com/questions/4125033/floating-point-division-vs-floating-point-multiplication/45899202#45899202. The key is that division has low impact on the throughput of *other* non-divide instructions, but the divide unit itself isn't fully pipelined. – Peter Cordes Oct 01 '17 at 18:14
  • 1
    @old_timer - plenty of FP heavy algorithms are directly affected by FPU latency and throughout. The remainder are largely affected by memory (including cache) latency and throughout. – BeeOnRope Oct 01 '17 at 18:15
  • I posted a probably ill-advised [answer on gamedev.SE](https://gamedev.stackexchange.com/questions/27196/which-opcodes-are-faster-at-the-cpu-level/104534#104534) which tries to assign a single "cost" number to different operations on modern x86 CPUs. More than half of the answer explains why this is bogus and only a *very* rough approximation, especially for floating point where latency is much worse than throughput. – Peter Cordes Oct 01 '17 at 18:18

1 Answers1

16

At a minimum, one must understand that an operation has at least two interesting timings: the latency and the throughput.

Latency

The latency is how long any particular operation takes, from its inputs to its output. If you had a long series of operations where the output of one operation is fed into the input of the next, the latency would determine the total time. For example, an integer multiplication on most recent x86 hardware has a latency of 3 cycles: it takes 3 cycles to complete a single multiplication operation. Integer addition has a latency of 1 cycle: the result is available the cycle after the addition executes. Latencies are generally positive integers.

Throughput

The throughput is the number of independent operations that can be performed per unit time. Since CPUs are pipelined and superscalar, this is often more than the inverse of the latency. For example, on most recent x86 chips, 4 integer addition operations can execute per cycle, even though the latency is 1 cycle. Similarly, 1 integer multiplication can execute, on average per cycle, even though any particular multiplication takes 3 cycles to complete (meaning that you must have multiple independent multiplications in progress at once to achieve this).

Inverse Throughput

When discussing instruction performance, it is common to give throughput numbers as "inverse throughput", which is simply 1 / throughput. This makes it easy to directly compare with latency figures without doing a division in your head. For example, the inverse throughput of addition is 0.25 cycles, versus a latency of 1 cycle, so you can immediately see that you if you have sufficient independent additions, they use only something like 0.25 cycles each.

Below I'll use inverse throughput.

Variable Timings

Most simple instructions have fixed timings, at least in their reg-reg form. Some more complex mathematical operations, however, may have input-dependent timings. For example, addition, subtraction and multiplication usually have fixed timings in their integer and floating point forms, but on many platforms division has variable timings in integer, floating point or both. Agner's numbers often show a range to indicate this, but you shouldn't assume the operand space has been tested extensively, especially for floating point.

The Skylake numbers below, for example, show a small range, but it isn't clear if that's due to operand dependency (which would likely be larger) or something else.

Passing denormal inputs, or results that themselves are denormal may incur significant additional cost depending on the denormal mode. The numbers you'll see in the guides generally assume no denormals, but you might be able to find a discussion of denormal costs per operation elsewhere.

More Details

The above is necessary but often not sufficient information to fully qualify performance, since you have other factors to consider such as execution port contention, front-end bottlenecks, and so on. It's enough to start though and you are only asking for "rule of thumb" numbers if I understand it correctly.

Agner Fog

My recommended source for measured latency and inverse throughput numbers are Agner's Fogs guides. You want the files under 4. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs, which lists fairly exhaustive timings on a huge variety of AMD and Intel CPUs. You can also get the numbers for some CPUs directly from Intel's guides, but I find them less complete and more difficult to use than Agner's.

Below I'll pull out the numbers for a couple of modern CPUs, for the basic operations you are interested in.

Intel Skylake

                         Lat  Inv Tpt
add/sub (addsd, subsd)     4      0.5
multiply (mulsd)           4      0.5
divide (divsd)         13-14        4
sqrt (sqrtpd)          15-16      4-6

So a "rule of thumb" for latency would be add/sub/mul all cost 1, and division and sqrt are about 3 and 4, respectively. For throughput, the rule would be 1, 8, 8-12 respectively. Note also that the latency is much larger than the inverse throughput, especially for add, sub and mul: you'd need 8 parallel chains of operations if you wanted to hit the max throughput.

AMD Ryzen

                         Lat  Inv Tpt
add/sub (addsd, subsd)     3      0.5
multiply (mulsd)           4      0.5
divide (divsd)          8-13      4-5
sqrt (sqrtpd)          14-15      4-8

The Ryzen numbers are broadly similar to recent Intel. Addition and subtraction are slightly lower latency, multiplication is the same. Latency-wise, the rule of thumb could still generally be summarized as 1/3/4 for add,sub,mul/div/sqrt, with some loss of precision.

Here, the latency range for divide is fairly large, so I expect it is data dependent.

BeeOnRope
  • 51,419
  • 13
  • 149
  • 309
  • Thanks, this is a pretty darn good answer. It appears there is no `sqr` equivalent in the x84 instruction set, and certainly not a `pow`, so I will just assume it is a multiplication. – DrZ214 Sep 30 '17 at 20:25
  • instlatx64 has the latency of sqrtpd at 13 to 18 on [skylake](http://users.atw.hu/instlatx64/GenuineIntel00506E3_Skylake_InstLatX64.txt), odd that it's different.. it's also slightly more explicit and lists the 13 with the input being 0.0 or 1.0 (but possibly applies to other numbers with 0 significand?). Perhaps someone who has the actual chip can confirm – harold Sep 30 '17 at 20:27
  • To do `sqr` you'd just use the same register as both operands to the `mulsd`, like `mulsd xmm0, xmm0`. – BeeOnRope Sep 30 '17 at 20:27
  • @harold - yes as soon as you get into timing "ranges" you need a lot more details to fully specify everything. Clearly neither approach searches the entire input domain (code isn't available for either implementation, AFAIK), but perhaps the AIDA64 one searches a larger range of inputs so gets a larger range of timings. I can test this on my Skylake if someone tells me what are likely to be the "interesting" inputs. – BeeOnRope Sep 30 '17 at 20:30
  • 2
    Maybe it depends on the popcnt of the significand, or the index of the lowest set bit, maybe something else, that sounds reasonably testable – harold Sep 30 '17 at 20:49
  • 1
    @DrZ214: note that if you compile for a target with FMA instructions (like Haswell or Bulldozer), you'll get asm output that looks like `tmp1 = Ship.x - Earth.x`, `tmp2 = Ship.y - Earth.y`, `tmp3 = tmp2*tmp2`, result = `fma(tmp1, tmp1, tmp3)`, where the last step computes `tmp1*tmp1 + tmp3` in a single instruction that runs about as fast as a multiply. And of course the costs of things are 2x or 4x cheaper if the compiler can auto-vectorize with SIMD. (SSE/AVX). With SSE but not AVX, you also sometimes have front-end bottlenecks from MOV instructions to copy registers. – Peter Cordes Oct 01 '17 at 18:37