Analysis of float/double precision in 32 decimal digits

Question

From a .c file of another guy, I saw this:

const float c = 0.70710678118654752440084436210485f;

where he wants to avoid the computation of sqrt(1/2).

Can this be really stored somehow with plain C/C++? I mean without loosing precision. It seems impossible to me.

I am using C++, but I do not believe that precision difference between this two languages are too big (if any), that' why I did not test it.

So, I wrote these few lines, to have a look at the behaviour of the code:

std::cout << "Number:    0.70710678118654752440084436210485\n";

const float f = 0.70710678118654752440084436210485f;
std::cout << "float:     " << std::setprecision(32) << f << std::endl;

const double d = 0.70710678118654752440084436210485; // no f extension
std::cout << "double:    " << std::setprecision(32) << d << std::endl;

const double df = 0.70710678118654752440084436210485f;
std::cout << "doublef:   " << std::setprecision(32) << df << std::endl;

const long double ld = 0.70710678118654752440084436210485;
std::cout << "l double:  " << std::setprecision(32) << ld << std::endl;

const long double ldl = 0.70710678118654752440084436210485l; // l suffix!
std::cout << "l doublel: " << std::setprecision(32) << ldl << std::endl;

The output is this:

                   *       ** ***
                   v        v v
Number:    0.70710678118654752440084436210485    // 32 decimal digits
float:     0.707106769084930419921875            // 24 >>      >>
double:    0.70710678118654757273731092936941
doublef:   0.707106769084930419921875            // same as float
l double:  0.70710678118654757273731092936941    // same as double
l doublel: 0.70710678118654752438189403651592    // suffix l

where * is the last accurate digit of float, ** the last accurate digit of double and *** the last accurate digit of long double.

The output of double has 32 decimal digits, since I have set the precision of std::cout at that value.

float output has 24, as expected, as said here:

float has 24 binary bits of precision, and double has 53.

I would expect the last output to be the same with the pre-last, i.e. that the f suffix would not prevent the number from becoming a double. I think that when I write this:

const double df = 0.70710678118654752440084436210485f;

what happens is that first the number becomes a float one and then stored as a double, so after the 24th decimal digits, it has zeroes and that's why the double precision stops there.

Am I correct?

From this answer I found some relevant information:

float x = 0 has an implicit typecast from int to float.
float x = 0.0f does not have such a typecast.
float x = 0.0 has an implicit typecast from double to float.

[EDIT]

About __float128, it is not standard, thus it's out of the competition. See more here.

Once you use a base-representation for `sqrt(1/2)`, you lose precision in any case!!! — barak manos, May 13 '14 at 10:51
Try `long double`, it is `extended double Precision` (80 to 96 bits wide) — Don't You Worry Child, May 13 '14 at 10:51
You can easily store them as text. What do you need this for? Maybe you need some high precision float library? — zch, May 13 '14 at 10:53
@JoachimPileborg edited. I do not want to use the libraries for very good precision. The question is more about understanding, rather than just make it work. — gsamaras, May 13 '14 at 10:55
Yep - nice analysis (You might change the title to something like 'Analysis of float/double precision', though) — , May 13 '14 at 10:58
@DieterLücking did that, but also appended for 32 decimal digits. If you think that this does not help improving the answer, let me know. — gsamaras, May 13 '14 at 12:25
@Don'tYouWorryChild long double is 80 bits on x86, after **padding** it becomes 96 bits on x86 and maybe 128 bits on x86_64, so the remaining bits have no significance — phuclv, May 13 '14 at 14:50

score 5 · Accepted Answer · answered May 13 '14 at 12:41

5

From the standard:

There are three ﬂoating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of ﬂoating-point types is implementation-deﬁned.

So you can see your issue with this question: the standard doesn't actually say how precise floats are.

In terms of standard implementations, you need to look at IEEE754, which means the other two answers from Irineau and Davidmh are perfectly valid approaches to the problem.

As to suffix letters to indicate type, again looking at the standard:

The type of a ﬂoating literal is double unless explicitly speciﬁed by a suﬃx. The suﬃxes f and F specify float, the suﬃxes l and L specify long double.

So your attempt to create a long double will just have the same precision as the double literal you are assigning to it unless you use the L suffix.

I understand that some of these answers may not seem satisfactory, but there is a lot of background reading to be done on the relevant standards before you can dismiss answers. This answer is already longer than intended so I won't try and explain everything here.

And as a final note: Since the precision is not clearly defined, why not have a constant that's longer than it needs to be? Seems to make sense to always define a constant that is precise enough to always be representable regardless of type.

answered May 13 '14 at 12:41

Matt

6,892
2
26
53

I didn't know they were called suffixes. Nice explanation. So, you mean that they original author of that long number, did that so that he can use as mush precision as possible in the corresponding system?? "before you can dismiss answers." I did not downvoted any of the answers. I can not test the first one and I am waiting for a reply to my comment and the python one didn't mention that same stand for C. So, I think I didn't dismiss any answer. – gsamaras May 13 '14 at 12:50
@G.Samaras, yes. It's a good idea for a number of reasons: Firstly (as mentioned above) you don't actually know the precision (though in practice it's nearly always the same). Secondly, if you wanted to change the code in the future to use a different data type, then your constants would still be valid. – Matt May 13 '14 at 13:00
I think that the answered would improve if you would write this comment in your answer. I am upvoting, thanks. – gsamaras May 13 '14 at 13:03
+1 for good answer. Note: I agree in general about "why not have a constant that's longer than it needs to be?", but there is flaw here. The constant given, though certainly good for the classic 32-bit float (binary32), and 64-bit double (binary64), it falls about 2-3 decimal digits short for a binary128. It's easy to look at this code a say "looks good for a long double - it has lots of digits", when it is lacking. – chux - Reinstate Monica May 13 '14 at 14:11
(cont) Better to size to the consonant's type precision (+ 3 decimal digits) or to the next larger type (+ 3 decimal digits), but to paste lots of digits (which happens to be correct to the _next_ typical precision **minus** 3 digits) is a problem laying in the weeds. – chux - Reinstate Monica May 13 '14 at 14:11
@chux, I was under the impression that in most cases `long double` would actually be double extended precision, since most processors have 80 bit fpu registers. From a quick test it appears that on my pc, a `long double` is stored as 16 bytes, but only performs maths using double extended precision (80 bits), furthermore float.h agrees with that, giving `LDBL_MAX_10_EXP` as `4932`, rather than `16383` which would be a binary128 float. However, I agree with you in principle that the constant doesn't leave room for potential future types... – Matt May 13 '14 at 15:38
... it's possible that this was the greatest precision (s)he could produce with the tools available. – Matt May 13 '14 at 15:38
Many (most?) PCs will have `long double` map to an 80-bit FP following, at least closely matching a double extended precision as specified in IEEE specs. Various compiler/packing options may affect if the data is stored in memory as 10 or 16 bytes. Note: be prepared for _decimal_ FP formats. – chux - Reinstate Monica May 13 '14 at 16:36
@Matt Regarding "looking at the standard: The type of a ﬂoating literal ...", what standard are you quoting? – chux - Reinstate Monica May 14 '14 at 14:19
@chux That was from the C++ standard working draft [n3337](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3337.pdf), it's a little dated (2012), but to the best of my knowledge it's the latest version that is free and open online. – Matt May 14 '14 at 14:55
@Matt and OP. I think there is an important consideration here between C and C++. As Matt quotes in C++, the type of constant is determined by f,F,l,L or not. But in C "floating constants are evaluated to a format whose range and precision may be greater than required by the type." §5.2.4.2.2 9 It depends on `FLT_EVAL_METHOD`. I am not certain how this plays on C++. – chux - Reinstate Monica May 14 '14 at 18:32
@chux, C still supports suffix for types. What that paragraph means is that even though you are doing arithmetic on a 32 bit float, it is still loaded into the 80 bit fpu registers. After the arithmetic, it is then truncated back into a 32 bit float. The C standard has a [similar paragraph](http://port70.net/~nsz/c/c11/n1570.html#6.4.4.2p4) to the one for C++, re float constant suffixes. – Matt May 14 '14 at 19:18

Davidmh · Answer 2 · 2014-05-13T12:32:15.597

1

Python's numerical library, numpy, has a very convenient float info function. All the types are the equivalent to C:

For C's float:

print numpy.finfo(numpy.float32)
Machine parameters for float32
---------------------------------------------------------------------
precision=  6   resolution= 1.0000000e-06
machep=   -23   eps=        1.1920929e-07
negep =   -24   epsneg=     5.9604645e-08
minexp=  -126   tiny=       1.1754944e-38
maxexp=   128   max=        3.4028235e+38
nexp  =     8   min=        -max
---------------------------------------------------------------------

For C's double:

print numpy.finfo(numpy.float64)
Machine parameters for float64
---------------------------------------------------------------------
precision= 15   resolution= 1.0000000000000001e-15
machep=   -52   eps=        2.2204460492503131e-16
negep =   -53   epsneg=     1.1102230246251565e-16
minexp= -1022   tiny=       2.2250738585072014e-308
maxexp=  1024   max=        1.7976931348623157e+308
nexp  =    11   min=        -max
---------------------------------------------------------------------

And for C's long float:

print numpy.finfo(numpy.float128)
Machine parameters for float128
---------------------------------------------------------------------
precision= 18   resolution= 1e-18
machep=   -63   eps=        1.08420217249e-19
negep =   -64   epsneg=     5.42101086243e-20
minexp=-16382   tiny=       3.36210314311e-4932
maxexp= 16384   max=        1.18973149536e+4932
nexp  =    15   min=        -max
---------------------------------------------------------------------

So, not even long float (128 bits) will give you the 32 digits you want. But, do you really need them all?

edited May 13 '14 at 12:32

answered May 13 '14 at 10:58

Davidmh

3,551
16
35

1

The decimal precision of quadruple precision is about 34. http://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format – lrineau May 13 '14 at 11:06
The question is not in `python`. Moreover, as I already said, the question is not about if I need them, but for understanding the behaviour of the code. However, I am not downvoting your answer, because maybe you can improve it. – gsamaras May 13 '14 at 12:28
I am using Python because it has this very convenient function. All the types correspond to C types, so what is true for them, is true for you too. In fact, they just follow IEEE standards. – Davidmh May 13 '14 at 12:33
My comment on the usefulness is more for the record. – Davidmh May 13 '14 at 12:34
Now that you edited the correlation of these languages, I will upvote. I do not know `Python` to know that fact from before you told me. :) – gsamaras May 13 '14 at 13:01
2

@Davidmh: NumPy's `float128` doesn't necessarily correspond to `long double` in C. There is no specification on the number of bits that `long double` uses. On many (not all) x86 platforms, it uses the 80-bit extended precision mode provided by its FPU. There's certainly no specification that states that it provides quadruple precision. So, drawing conclusions based on NumPy's (admittedly useful) analysis of its `float128` type's precision is fruitless. With that said, the parameters reported for its "128-bit" data type look more like what you would expect from 80 bits. – Jason R May 13 '14 at 13:06
@JasonR my original answer had typo: it corresponds to long float. You are right, platform dependency can make things messy. – Davidmh May 13 '14 at 13:20
2

@Davidmh: There is no such type as `long float` in C. – Jason R May 13 '14 at 13:22
@JasonR strange... Numpy's documentation says `128-bit floating-point number. Character code: 'g'. C long float compatible.` I am not expert in C, so I trust you more than myself. – Davidmh May 13 '14 at 13:27
1

This information titled "For C's float:", "For C's double:", etc. is misleading. This data represent a _sample_ C's floating point characteristics. These characteristics differ between C machines. The C specification does _not_ define these values as shown. – chux - Reinstate Monica May 13 '14 at 13:52

lrineau · Answer 3 · 2014-05-15T16:15:39.297

Some compilers have an implementation of the binary128 floating point format, normalized by IEEE 754-2008. Using gcc, for example, the type is __float128. That floating point format have about 34 decimal precision (log(2^113)/log(10)).

You can use the Boost Multiprecision library, to use their wrapper float128. That implementation will either use native types, if available, or use a drop-in replacement.

Let's extend your experiment with that new non-standard type __float128, with a recent g++ (4.8):

// Compiled with g++ -Wall -lquadmath essai.cpp
#include <iostream>
#include <iomanip>
#include <quadmath.h>
#include <sstream>

std::ostream& operator<<(std::ostream& out, __float128 f) {
  char buf[200];
  std::ostringstream format;
  format << "%." << (std::min)(190L, out.precision()) << "Qf";
  quadmath_snprintf(buf, 200, format.str().c_str(), f);
  out << buf;
  return out;
}

int main() {
  std::cout.precision(32);
  std::cout << "Number:    0.70710678118654752440084436210485\n";

  const float f = 0.70710678118654752440084436210485f;
  std::cout << "float:     " << std::setprecision(32) << f << std::endl;

  const double d = 0.70710678118654752440084436210485; // no f extension
  std::cout << "double:    " << std::setprecision(32) << d << std::endl;

  const double df = 0.70710678118654752440084436210485f;
  std::cout << "doublef:   " << std::setprecision(32) << df << std::endl;

  const long double ld = 0.70710678118654752440084436210485;
  std::cout << "l double:  " << std::setprecision(32) << ld << std::endl;

  const long double ldl = 0.70710678118654752440084436210485l; // l suffix!
  std::cout << "l doublel: " << std::setprecision(32) << ldl << std::endl;

  const __float128 f128 = 0.70710678118654752440084436210485;
  const __float128 f128f = 0.70710678118654752440084436210485f; // f suffix
  const __float128 f128l = 0.70710678118654752440084436210485l; // l suffix
  const __float128 f128q = 0.70710678118654752440084436210485q; // q suffix

  std::cout << "f128:      " << f128 << std::endl;
  std::cout << "f f128:    " << f128f << std::endl;
  std::cout << "l f128:    " << f128l << std::endl;
  std::cout << "q f128:    " << f128q << std::endl;
}

The output is:

                   *       ** ***        ****
                   v        v v             v
Number:    0.70710678118654752440084436210485
float:     0.707106769084930419921875
double:    0.70710678118654757273731092936941
doublef:   0.707106769084930419921875
l double:  0.70710678118654757273731092936941
l doublel: 0.70710678118654752438189403651592
f128:      0.70710678118654757273731092936941
f f128:    0.70710676908493041992187500000000
l f128:    0.70710678118654752438189403651592
q f128:    0.70710678118654752440084436210485

where * is the last accurate digit of float, ** the last accurate digit of double, *** the last accurate digit of long double, and **** is the last accurate digit of __float128.

As said by another answer, the C++ standard does not say what is the precision of the various floating point types (like it does not says what is the size of the integral types). It only specifies minimal precision/size of those types. But the norm IEEE754 does specify all that! The FPU of all lot of architectures does implement that norm IEEE745, and the recent versions of gcc implement the type binary128 of the norm with the extension __float128.

As for the explanation of your code, or mine, an expression like 0.70710678118654752440084436210485f is a floating-point literal. It has a type, that is defined by its suffix, here f for float. And thus the value of the literal correspond to the nearest value of the given type from the given number. That explains why, for example, the precision of "doublef" is the same as for "float", in your code. In recent gcc versions, there is an extension, that allows to define floating-point literals of type __float128, with the Q suffix (Quadruple-precision).

See my edit. I can not print the `__float128`, since `std::cout` says `ambiguous overload for ‘operator< — gsamaras, May 13 '14 at 12:38
To be honest, I would really like to print it, so that I can augment my analysis above. If we achieve that, I think that this answer deserves at least a +1. — gsamaras, May 13 '14 at 13:05
In addition to the storage of the floating-point value itself, you have another problem: how to specify a literal initializer that has enough precision to provide all of the desired digits. If you're using C++11, you can implement this using [user-defined literals](http://akrzemi1.wordpress.com/2012/08/12/user-defined-literals-part-i/). I'm not sure if Boost.Multiprecision has added this capability yet. — Jason R, May 13 '14 at 13:12
binary128 has 113 bit precision, yet the constant `0.70710678118654752440084436210485f;` only provides ~106 bits. Should this code be used with such FP numbers, it could easily be a difficult bug to catch early on in development. — chux - Reinstate Monica, May 13 '14 at 13:48
So neither @JasonR, @chux and Irineau know how to print the `__float128`? Refer to my 1st comment here. — gsamaras, May 13 '14 at 14:31
@G. Samaras: I plan to edit my answer, but no time to make a good edit, so far (I am at work). Have a loop at: https://gcc.gnu.org/onlinedocs/libquadmath/ and particularly the section "I/O Library Routines". — lrineau, May 13 '14 at 14:36
I didn't know that, sorry. I did search and already found what you linked to. It won't find the quadmath_sprintf. Also this didn't help http://stackoverflow.com/questions/13571621/quadruple-precision-in-g-4-6-3-linux-using-quadmath — gsamaras, May 13 '14 at 14:46
@chux even if you provide more than enough precision, with `f` suffix it'll be truncated down to 24 bits like float, not 106 bits — phuclv, May 13 '14 at 14:54
The behaviour of my code in the question supports @LưuVĩnhPhúc's comment. — gsamaras, May 13 '14 at 15:42
@Lưu Vĩnh Phúc the `f` does not necessarily truncate down to 24 bits. It insures that number will be treated as a `float`. C does not define a `float` to have 24 bits, but at _least_ 6 decimal digits worth of precision. Consider a system that treats all FP as the same size and that size is `binary128`. In that case, no truncation occurs and the constant is not as close to sqrt(0.5) as it should be. — chux - Reinstate Monica, May 13 '14 at 16:29

Analysis of float/double precision in 32 decimal digits

3 Answers3

Linked