15

Yeah, I meant to say 80-bit. That's not a typo...

My experience with floating point variables has always involved 4-byte multiples, like singles (32 bit), doubles (64 bit), and long doubles (which I've seen referred to as either 96-bit or 128-bit). That's why I was a bit confused when I came across an 80-bit extended precision data type while I was working on some code to read and write to AIFF (Audio Interchange File Format) files: an extended precision variable was chosen to store the sampling rate of the audio track.

When I skimmed through Wikipedia, I found the link above along with a brief mention of 80-bit formats in the IEEE 754-1985 standard summary (but not in the IEEE 754-2008 standard summary). It appears that on certain architectures "extended" and "long double" are synonymous.

One thing I haven't come across are specific applications that make use of extended precision data types (except for, of course, AIFF file sampling rates). This led me to wonder:

  • Has anyone come across a situation where extended precision was necessary/beneficial for some programming application?
  • What are the benefits of an 80-bit floating point number, other than the obvious "it's a little more precision than a double but fewer bytes than most implementations of a long double"?
  • Is its applicability waning?
phuclv
  • 27,258
  • 11
  • 104
  • 360
gnovice
  • 123,396
  • 14
  • 248
  • 352

5 Answers5

24

Intel's FPUs use the 80-bit format internally to get more precision for intermediate results.

That is, you may have 32-bit or 64-bit variables, but when they are loaded into the FPU registers, they are converted to 80 bit; the FPU then (by default) performs all calculations in 80 but; after the calculation, the result is stored back into a 32-bit or 64-bit variables.

BTW - A somewhat unfortunate consequence of this is that debug and release builds may produce slightly different results: in the release build, the optimizer may keep an intermediate variable in an 80-bit FPU register, while in the debug build, it will be stored in a 64-bit variable, causing loss of precision. You can avoid this by using 80-bit variables, or use an FPU switch (or compiler option) to perform all calculations in 64 bit.

oefe
  • 17,294
  • 7
  • 41
  • 65
  • 2
    Sounds like one of those "side effects" involving "subtle differences in the behaviour of the arithmetic" that the Wikipedia page mentions. =) So, since the IEEE 754-2008 specs mention 128-bit "quad" formats, should we expect 80-bit FPUs to get phased out soon? – gnovice Mar 07 '09 at 03:34
  • I don't know where the standard is heading, but I would expect that at least Intel will keep 80-bit support for a long time to come to maintain compatibility, even if they add 128-bit support. – oefe Mar 07 '09 at 20:15
  • 1
    @gnovice: Unlikely; the 80-bit format is still a valid IEEE-754 (2008) type. Specifically, it is one of many options for a "binary64 extended" type allowed by the IEEE-754 standard. That said, most platforms either use or are moving toward using SSE (native 32- and 64-bit) for floating-point calculation because it offers better performance. – Stephen Canon Apr 25 '11 at 05:49
  • 1
    When using Borland compilers, the easy way to avoid different behavior on debug and release builds was simply to use variables of 80-bit types. Too bad Microsoft never supported them. – supercat May 07 '14 at 19:28
  • @supercat because MS want to use the faster SSE instead of x87 – phuclv Jun 15 '14 at 01:38
  • @LưuVĩnhPhúc: Even if SSE is faster than x87, I don't know that it's so much faster that x87 wouldn't be faster in cases where an 80-bit type could compute things directly but a 64-bit type would require a multi-step computation. For example, while Kahan summation with double will accurately record up to 114 bits of significand, it requires multiple computations for each value to be summed. Many summation scenarios which would require Kahan summation in the absence of an extended type could simply be done with one extended-precision ADD for each value. – supercat Jun 15 '14 at 16:53
  • 1
    @LưuVĩnhPhúc: Actually, what I'd like to see, given the popularity of 3d graphics, would be types which hold three 21, 40, or 80-bit variables into a 64, 128, or 256 data type along with (for the larger types) an 8- or 16-bit general-purpose field. That would pack nicely, while offering better precision than 16, 32, or 64-bit floats. – supercat Jun 15 '14 at 17:00
9

For me the use of 80 bits is ESSENTIAL. This way I get high-order (30,000) eigenvalues and eigenvectors of symmetric matrices with four more figures when using the GOTO library for vector inner products, viz., 13 instead of 9 significant figures for the kind of matrices that I use in relativistic atomic calculations, which is necessary to avoid falling into the sea of negative-energy states. My other option is using quadruple-precision arithmetic that increases CPU time 60-70 times and also increases RAM requirements. Any calculation relying on inner products of large vectors will benefit. Of course, in order to keep partial inner product results within registers it is necessary to use assembler language, as in the GOTO libraries. This is how I came to love my old Opteron 850 processors, which I will be using as long as they last for that part of my calculations.

The reason 80 bits is fast, whereas greater precision is so much slower, is that the CPU's standard floating-point hardware has 80-bit registers. Therefore, if you want the extra 16 bits (11 extra bits of mantissa, four extra bits of exponent and one extra bit effectively unused), then it doesn't really cost you much to extend from 64 to 80 bits -- whereas to extend beyond 80 bits is extremely costly in terms of run time. So, you might as well use 80-bit precision if you want it. It is not cost-free to use, but it comes pretty cheap.

thb
  • 11,861
  • 3
  • 35
  • 63
  • 1
    in many cases rewrite your library to take advantage of SIMD instructions would result in much speedup than using extended precision. Choose how to store the limbs in the quadruple/extended precision wisely in the SSE/AVX registers will allow you to do arithmetics with multiple values at once – phuclv Mar 28 '17 at 06:34
  • for example store 16 exponent+sign parts of each value in AVX2 ymm1 register then ymm2 store the high 64 bits of value 1 to 4, ymm3 store the low 64 bits of value 1 to 4... and now you can things on 4-16 values at once http://stackoverflow.com/a/27978043/995714 – phuclv Mar 28 '17 at 06:48
  • or just use [double-double arithmetic](https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Double-double_arithmetic). That'll provide just slightly lower than quadruple precision but with significantly more speed [Emulate “double” using 2 “float”s](http://stackoverflow.com/q/6769881/995714), [float128 and double-double arithmetic](http://stackoverflow.com/q/31647409/995714), http://stackoverflow.com/q/9857418/995714. There are tons of better solutions, you just have to do enough research – phuclv Mar 29 '17 at 16:56
5

Wikipedia explains that an 80-bit format can represent an entire 64-bit integer without losing information. Thus the floating-point unit of the CPU can be used to implement multiplication and division for integers.

Nathan Kitchen
  • 4,401
  • 1
  • 22
  • 17
2

Another advantage not yet mentioned for 80-bit types is that on 16-bit or 32-bit processors which don't have floating-point units but do have a "multiply" instruction which produces a result twice as long as the operands (16x16->32 or 32x32->64), arithmetic on a 64-bit mantissa subdivided into four or two 16-bit or 32-bit registers will be faster than arithmetic on a 53-bit mantissa which spans the same number of registers but has to share 12 register bits with the sign and exponent. For applications which don't need anything more precise than float, computations on a 48-bit "extended float" type could likewise be faster than computations on a 32-bit float.

While some people might bemoan the double-rounding behavior of extended-precision types, that is realistically speaking only an issue in specialized applications requiring full bit-exact cross-platform reproducibility. From an accuracy standpoint, the difference between a rounding error of 64/128 vs 65/128, or 1024/2048ulp vs 1025/2048, is a non-issue; in languages with extended-precision variable types and consistent extended-precision semantics, use of extended types on many platforms without floating-point hardware (e.g. embedded systems) will offer both higher accuracy and better speed than single- or double-precision floating-point types.

supercat
  • 69,493
  • 7
  • 143
  • 184
1

I used 80-bit for some pure math research. I had to sum terms in an infinite series that grew quite large, outside the range of doubles. Convergence and accuracy weren't concerns, just the ability to handle large exponents like 1E1000. Perhaps some clever algebra could have simplified things, but it was way quicker and easier to just code an algorithm with extended precision, than to spend any time thinking about it.

DarenW
  • 15,697
  • 7
  • 59
  • 96