long double subnormals/denormals get truncated to 0 [-Woverflow]

Question

In the IEEE754 standarad, the minimum strictly positive (subnormal) value is 2⁻¹⁶⁴⁹³ ≈ 10⁻⁴⁹⁶⁵ using Quadruple-precision floating-point format. Why does GCC reject anything lower than 10^-4949? I'm looking for an explanation of the different things that could be going on underneath which determine the limit to be 10^-4949 rather than 10⁻⁴⁹⁶⁵.

#include <stdio.h>

void prt_ldbl(long double decker) {
    unsigned char * desmond = (unsigned char *) & decker;
    int i;

    for (i = 0; i < sizeof (decker); i++) {
         printf ("%02X ", desmond[i]);
    }
    printf ("\n");
}

int main()
{
    long double x = 1e-4955L;
    prt_ldbl(x);
}

I'm using GNU GCC version 4.8.1 online - not sure which architecture it's running on (which I realize may be the culprit). Please feel free to post your findings from different architectures.

FWIW `clang` gives me: `warning: magnitude of floating-point constant too small for type 'long double'; minimum is 3.64519953188247460253E-4951 [-Wliteral-range]` — Paul R, Sep 18 '14 at 15:21
Another FWIW: `clang -dM -E - < /dev/null` gives `#define __LDBL_DENORM_MIN__ 3.64519953188247460253e-4951L` and `#define __LDBL_MIN__ 3.36210314311209350626e-4932L`. — Paul R, Sep 18 '14 at 15:27

Pascal Cuoq · Accepted Answer · 2014-09-19T15:04:39.227

2

Your long double type may not be(*) quadruple-precision. It may simply be the 387 80-bit extended-double format. This format has the same number of bits for the exponent as quad-precision, but many fewer significand bits, so the minimum value that would be representable in it sounds about right (2^-16445)

(*) Your long double is likely not to be quad-precision, because no processor implements quad-precision in hardware. The compiler can always implement quad-precision in software, but it is much more likely to map long double to double-precision, to extended-double or to double-double.

edited Sep 19 '14 at 15:04

answered Sep 18 '14 at 15:22

Pascal Cuoq

75,447
6
144
260

I thought this too at first, but it seems that later versions of gcc and clang support 128 bit quad precision for long double. Live demo: http://coliru.stacked-crooked.com/a/5e80223e90de9ea6 – Paul R Sep 18 '14 at 15:28
@PaulR I have seen GCC's `__float128`, but it is not mapped to `long double`. One of the awkward things about using it is precisely that you have to pass a string to a function when you want a constant that cannot be represented exactly as a `double` or `long double` constant: https://gcc.gnu.org/onlinedocs/libquadmath/strtoflt128.html#strtoflt128 – Pascal Cuoq Sep 18 '14 at 15:30
Sure - but evidently gcc and clang both think that `sizeof(long double) == 16` these days. – Paul R Sep 18 '14 at 15:30
2

@PaulR The size of 16 bytes for `long double` is not indicative of the number of meaningful bits in it. In the case of a compiler that maps `long double` to 80-bit extended-double, the representation is 6/16 padding for alignment purposes. – Pascal Cuoq Sep 18 '14 at 15:32
Aha - you may be right then - perhaps it's just 80 bits "under the hood" ? – Paul R Sep 18 '14 at 15:32
You mean to say that the questioner is probably not writing code for IBM z/Architecture? – Stephen Canon Sep 18 '14 at 16:07
The code actually works with `xlc -qfloat=ieee` on z, and you can go all the way down to 10^−4965. – Harvinder Sep 18 '14 at 16:18

score 1 · Answer 2 · answered Sep 18 '14 at 15:35

1

The smallest 80-bit long double is around 2^{-16382 - 63} ~= 10^-4951, not 2^-164934. So the compiler is entirely correct; your number is smaller than the smallest subnormal.

answered Sep 18 '14 at 15:35

tmyklebu

13,171
3
25
51

There is something I don't understand: the minimum positive quad-precision value is 2^−16493, since Wikipedia says so. The minimum positive extended-double value should be 2^49 times larger, because this format has effectively 49 fewer significand bits (6 * 8 + 1). So that should make the smallest 80-bit `long double` 2^-16444, should it not? – Pascal Cuoq Sep 18 '14 at 15:52
MPFR developers think 2^-16445: http://lists.gforge.inria.fr/pipermail/mpfr-commits/2012-March/007510.html – Pascal Cuoq Sep 18 '14 at 15:56
Joseph Myers thinks 2^-16444: https://www.cygwin.com/ml/glibc-cvs/2012-q3/msg00425.html – Pascal Cuoq Sep 18 '14 at 15:59
1

@PascalCuoq: The simple C program that repeatedly divides `(long double)1` by 2 until it gets zeor and prints out the last nonzero thing it got says the answer is about 3.6452e-4951, which is 2^(-16382 - 63). Remember that 80-bit `long double`s don't have an implied bit. – tmyklebu Sep 18 '14 at 16:38
The explicit leading bit is the “1” in “(6 * 8 + 1)”. It turns out that Wikipedia was wrong when it said that the smallest quad-precision denormal was 2^−16493. It should have said 2^-16494. The reasoning that one should be 2^49 times more than the other holds: http://en.wikipedia.org/w/index.php?title=Quadruple-precision_floating-point_format&diff=626209475&oldid=611427180 (Now investigating the presence of 16444 in the GCC source code) – Pascal Cuoq Sep 19 '14 at 15:03
For humorous value, state of the 16444 in glibc 2.20: https://pbs.twimg.com/media/Bx6CcVuIAAA5ooC.jpg:large – Pascal Cuoq Sep 19 '14 at 15:52
@PascalCuoq: `strtold` in my glibc 2.19 seems to work properly for me. You sure that field means "smallest representable power of two" rather than something that's off-by-one from that? – tmyklebu Sep 19 '14 at 16:47
The Intel and Motorola extended formats can represent the same sets of values (the difference is only in the layout of the bits), so at least one entry is off between these two. I think that they are all off by one except the Motorola, but that since this is only in an automatic test generation function, the only consequence is poor testing around the smallest denormals. – Pascal Cuoq Sep 19 '14 at 17:03

long double subnormals/denormals get truncated to 0 [-Woverflow]

2 Answers2