How would fabs(double) be implemented on x86? Is it an expensive operation?

Question

High-level programming languages often provide a function to determine the absolute-value of a floating-point value. For example, in the C standard library, there is the fabs(double) function.

How is this library function actually implemented for x86 targets? What would actually be happening "under the hood" when I call a high-level function like this?

Is it an expensive operation (a combination of multiplication and taking the square root)? Or is the result found just by removing a negative sign in memory?

That's not really a matter of the language, but the target platform. — too honest for this site, Jun 19 '17 at 12:00
Nothing says that the `fabs` implementation have to be the same for both C and C++. Even if they share the same roots and even some basic syntax, they are two *very* different languages. — Some programmer dude, Jun 19 '17 at 12:02
FYI: gcc and clang are both open source and you can find and look at the source code yourself quite easily. — NathanOliver, Jun 19 '17 at 12:03
Then I suggest you start by examining the generated code for an `fabs` call, to see what the compiler does with it. If it does call a function, then download the compilers standard library or the compiler itself (if available, in the case of GCC or Clang it will be) to see what the function does. — Some programmer dude, Jun 19 '17 at 12:05
@NathanOliver Ok. Let's assume I don't know how to read C nor Assembler which is certainly used for implementation of math functions, but I know that the Postscript interpreter I am using is implemented in C and that I am working on x86. — AlexG, Jun 19 '17 at 12:07
@AlexG: We are not a tutoring service. If you ask about a programming problem you are expected to understand the language well enough! So, what did you find out yourself? What **specifically** is unclear? Whcih x86 platform is it? x386? 486? i7? — too honest for this site, Jun 19 '17 at 12:10
Could you suggest a better place for asking? I am implementing time-critical stuff in PostScript, the target interpreter is written in C and I would like to know the consequences of using `abs`. That's all and my question is clear enough, I think. — AlexG, Jun 19 '17 at 12:14
"(combination of multiplication and taking the square root)?"-- I would be quite stunned if any implementation of `fabs()`, or even `abs()`, tried this. — ad absurdum, Jun 19 '17 at 12:38
Thank you. That is a satisfying answer. Would you mind to add it? I would accept it. The next time I will ask elsewhere. — AlexG, Jun 19 '17 at 12:44
@AlexG-- I am not sure that this qualifies as a complete answer; but implementations of `abs` that multiply and then take the square root would risk overflow issues and loss of precision. IAC, question is now closed, so no answers can be added. — ad absurdum, Jun 19 '17 at 12:58
Seriously? How is this too broad? This seems like a perfectly reasonable question. — Simon Byrne, Jun 19 '17 at 14:57
And to answer the question, on x86 it is typically implemented by the `ANDPD` instruction (basically bitwise AND on a floating point number). This is a fairly fast instruction, typically 1 clock cycle. — Simon Byrne, Jun 19 '17 at 15:06
@SimonByrne : Thank you for this convincing answer and for your backing comment before, very much appreciated. Thus, there *is* a typical implementation on x86 platforms, in contrast to what most of the other comments claim. — AlexG, Jun 20 '17 at 15:41

Cody Gray · Accepted Answer · 2017-06-28T10:41:21.813

In general, computing the absolute-value of a floating-point quantity is an extremely cheap and fast operation.

In practically all cases, you can simply treat the fabs function from the standard library as a black box, sprinkling it in your algorithms where necessary, without any need to worry about how it will affect execution speed.

If you want to understand why this is such a cheap operation, then you need to know a little bit about how floating-point values are represented. Although the C and C++ language standards do not actually mandate it, most implementations follow the IEEE-754 standard. In that standard, each floating-point value's representation contains a bit known as the sign bit, and this marks whether the value is positive or negative. For example, consider a double, which is a 64-bit double-precision floating-point value:

Bit-level representation of a double-precision floating-point value
^{(Image courtesy of Codekaizen, via Wikipedia, licensed under CC-bySA.)}

You can see the sign bit over there on the far left, in light blue. This is true for all precisions of floating-point values in IEEE-754. Therefore, taking the absolute value basically just amounts to flipping a byte in the value's representation in memory. In particular, you just need to mask off the sign bit (bitwise-AND), forcing it to 0—thus, unsigned.

Assuming that your target architecture has hardware support for floating-point operations, this is generally a single, one-cycle instruction—basically, as fast as can possibly be. An optimizing compiler will inline a call to the fabs library function, emitting that single hardware instruction in its place.

If your target architecture doesn't have hardware support for floating-point (which is pretty rare nowadays), then there will be a library that emulates these semantics in software, thus providing floating-point support. Typically, floating-point emulation is slow, but finding the absolute value is one of the fastest things you can do, since it is literally just manipulating a bit. You'll pay the overhead of a function call to fabs, but at worst, the implementation of that function will just involve reading the bytes from memory, masking off the sign bit, and storing the result back to memory.

Looking specifically at x86, which does implement IEEE-754 in hardware, there are two main ways that your C compiler will transform a call to fabs into machine code.

In 32-bit builds, where the legacy x87 FPU is being used for floating-point operations, it will emit an fabs instruction. (Yep, same name as the C function.) This strips the sign bit, if present, from the floating-point value at the top of the x87 register stack. On AMD processors and Intel Pentium 4, fabs is a 1-cycle instruction with a 2-cycle latency. On AMD Ryzen and all other Intel processors, this is a 1-cycle instruction with a 1-cycle latency.

In 32-bit builds that can assume SSE support, and on all 64-bit builds (where SSE is always supported), the compiler will emit an ANDPS instruction^* that does exactly what I described above: it bitwise-ANDs the floating-point value with a constant mask, masking out the sign bit. Notice that SSE2 doesn't have a dedicated instruction for taking the absolute value like x87 does, but that it doesn't even need one, because the multi-purpose bitwise-op instructions serve the job just fine. The execution time (cycles, latency, etc. characteristics) vary a bit more widely from one processor microarchitecture to another, but it generally has a throughput of 1–3 cycles, with a similar latency. If you like, you can look it up in Agner Fog's instruction tables for the processors of interest.

If you're really interested in digging into this, you might see this answer (hat tip to Peter Cordes), which explores a variety of different ways to implement an absolute-value function using SSE instructions, comparing their performance and discussing how you could get a compiler to generate the appropriate code. As you can see, since you're just manipulating bits, there are a variety of possible solutions! In practice, though, the current crop of compilers do exactly as I've described for the C library function fabs, which makes sense, because this is the best general-purpose solution.

__
_{^* Technically, this might also be ANDPD, where the D means "double" (and the S meant "single"), but ANDPD requires SSE2 support. SSE supports single-precision floating-point operations, and was available all the way back to the Pentium III. SSE2 is required for double-precision floating-point operations, and was introduced with the Pentium 4. SSE2 is always supported on x86-64 CPUs. Whether ANDPS or ANDPD is used is a decision made by the compiler's optimizer; sometimes you will see ANDPS being used on a double-precision floating-point value, since it just requires writing the mask the right way.}
_{Also, on CPUs that support AVX instructions, you'll generally see a VEX-prefix on the ANDPS/ANDPD instruction, so that it becomes VANDPS/VANDPD. Details on how this works and what its purpose is can be found elsewhere online; suffice it to say that mixing VEX and non-VEX instructions can result in a performance penalty, so compilers try to avoid it. Again, though, both of these versions have the same effect and virtually identical execution speeds.}

_{Oh, and because SSE is a SIMD instruction set, it is possible to compute the absolute value of multiple floating-point values at once. This, as you might imagine, is especially efficient. Compilers with auto-vectorization capabilities will generate code like this where possible. Example (mask can either be generated on-the-fly, as shown here, or loaded as a constant):}

cmpeqd xmm1, xmm1     ; generate the mask (all 1s) in a temporary register
psrld  xmm1, 1        ; put 1s in but the left-most bit of each packed dword
andps  xmm0, xmm1     ; mask off sign bit in each packed floating-point value

Dear @CodyGray , thank you so much for this thorough and comprehensible answer! It is so well formulated that it sounds like copied from a text book :) — AlexG, Jun 23 '17 at 10:25
You are most welcome. I thought the question deserved an answer, despite people's concerns with its initial formulation. The topic can be made very complicated, but I tried to keep it as simple as possible without *over*-simplifying. I'm glad you found it helpful, @Alex. Not copied from a textbook, though; just banged out in a few minutes on my keyboard. :-) I haven't even found textbooks that cover these types of things. — Cody Gray, Jun 23 '17 at 10:33
If you want an over-complicated version, see my [SSE absolute-value answer](https://stackoverflow.com/questions/32408665/fastest-way-to-compute-absolute-value-using-sse) with C intrinsics. I was trying to get the compiler to generate the mask on the fly instead of loading it, which is probably silly (and very hard to do with some compilers). I should go and simplify that answer, since part of the problem was looking for `_mm_uninitialized_ps();` instead of the actual `_mm_undefined_ps();` that compilers do actually support. — Peter Cordes, Jun 28 '17 at 04:36
Also, that question suggests some novel ideas for an fabs() implementation: e.g. subtract from 0 and then `maxps`, which would work but has a much longer critical path. — Peter Cordes, Jun 28 '17 at 04:48
@Peter Thanks for the pointer. I had upvoted your answer a long time ago, but since forgotten about it. I included a link here, so it doesn't get lost in the comments. If I have some time, I'll try to remember to dig in and see if I can get MSVC to generate the code you want there. Generating the mask on-the-fly is probably a better solution in one-off situations because it avoids the cache miss of loading the constant from memory, but you seem to be assuming in several places that the constant will be duplicated. It isn't; the compiler emits a single global constant when you call `fabs`. — Cody Gray, Jun 28 '17 at 10:49
As such, I can't imagine what the benefit would be to `const __m128 absmask = _mm_castsi128_ps(_mm_set1_epi32(~(1<<31));`. All that does is create extra bloat because of the duplicate symbol, and increase the chances of a miss. I'm skeptical of whether the theoretical optimizations (based on cycle counts, expected latencies, etc.) will actually bear fruit in real code. Sure, if you've got an inner loop where you're repeatedly finding the abs value, it may be worth it to optimize, but then the whole landscape changes because you can just load the constant into a register at the top of the loop. — Cody Gray, Jun 28 '17 at 10:52
`const __m128 absmask = _mm_castsi128_ps(_mm_set1_epi32(~(1<<31));` is supposed to be a local variable. It's just another way of writing `_mm_set1_ps(-0.0f)`, in case you'd rather think about bit patterns and don't want to worry about what a compiler with do with negative zero when compiling with `-ffast-math`. It should still share the same literal constant in memory. I wrote that answer almost 2 years ago, when I didn't know as much. Totally agreed that this only matters much in a weird case where the compiler can't / doesn't inline something, but you can gen on the fly before a loop. — Peter Cordes, Jun 28 '17 at 18:34
@CodyGray really love this answer! especially that you included how to create the constant mask quickly. If I can add anything useful, it's the fact that even the x87 fabs opcode was specified as simply flipping the sign bit. It's worth pointing out that this is the right thing to do, and not just a hack. http://c9x.me/x86/html/file_module_x86_id_80.html has a good table about the corner cases. — starmole, Jun 30 '17 at 03:40
Right, it's no hack, @starmole. But it *is* tied to the IEEE 754 representation of floating-point numbers. It makes good sense that `fabs` would be doing it this way. The 8087 was the first CPU to implement IEEE 754 (actually, it was an early draft; Intel was instrumental in drafting and pushing this standard). But it is possible for a CPU to represent FP values differently, and that would require a different implementation for `fabs`. That's why the initial formulation of the question was problematic (too broad), and I narrowed the scope to x86 in hopes of getting it reopened (success!). — Cody Gray, Jun 30 '17 at 09:41

How would fabs(double) be implemented on x86? Is it an expensive operation?

1 Answers1

Linked