Parse double precision IEEE floating-point on a C compiler with no double precision type

Question

I am working with an 8-bit AVR chip. There is no data type for a 64-bit double (double just maps to the 32-bit float). However, I will be receiving 64-bit doubles over Serial and need to output 64-bit doubles over Serial.

How can I convert the 64-bit double to a 32-bit float and back again without casting? The format for both the 32-bit and 64-bit will follow IEEE 754. Of course, I assume a loss of precision when converting to the 32-bit float.

For converting from 64-bit to 32-bit float, I am trying this out:

// Script originally from http://www.arduino.cc/cgi-bin/yabb2/YaBB.pl?num=1281990303
float convert(uint8_t *in) {
  union {
    float real;
    uint8_t base[4];
  } u;
  uint16_t expd = ((in[7] & 127) << 4) + ((in[6] & 240) >> 4);
  uint16_t expf = expd ? (expd - 1024) + 128 : 0;
  u.base[3] = (in[7] & 128) + (expf >> 1);
  u.base[2] = ((expf & 1) << 7) + ((in[6] & 15) << 3) + ((in[5] & 0xe0) >> 5);
  u.base[1] = ((in[5] & 0x1f) << 3) + ((in[4] & 0xe0) >> 5);
  u.base[0] = ((in[4] & 0x1f) << 3) + ((in[3] & 0xe0) >> 5);
  return u.real;
}

For numbers like 1.0 and 2.0, the above works, but when I tested with passing in a 1.1 as a 64-bit double, the output was off by a bit (literally, not a pun!), though this could be an issue with my testing. See:

// Comparison of bits for a float in Java and the bits for a float in C after
// converted from a 64-bit double. Last bit is different.
// Java code can be found at https://gist.github.com/912636
JAVA FLOAT:        00111111 10001100 11001100 11001101
C CONVERTED FLOAT: 00111111 10001100 11001100 11001100

Note that there is no exact representation for 1.1, whether double or float. One can imaging that shortening a double to a float can be done by just cutting the less significant bits or by rounding. I can't figure from your code what it is doing. — Ingo, Apr 10 '11 at 20:01
@Gabe - Being off by a bit for the LSB may be an acceptable loss of precision. However, it did cause me to question the code I provided and look for alternatives. — baalexander, Apr 10 '11 at 20:52
this takes me back - one of my first jobs (back in '87) was writing code to convert from PDP-11 float format to IEEE-754 format so I could read data files on a PC. — Alnitak, Apr 10 '11 at 20:54
I thought to myself "Why in the world do you need to do that?" until I read "8-bit AVR chip". :-) — In silico, Apr 10 '11 at 21:36
@Gabe: Being off by a bit is an error. End of story. Floating-point arithmetic has enough problems without the basic rounding operation being flaky. — TonyK, Apr 10 '11 at 23:03
@TonyK: This rounding algorithm isn't "flaky", it's rounding toward 0. Having a different default rounding mode from Java isn't my definition of an error. — Gabe, Apr 11 '11 at 00:20
@Gabe: Rounding towards 0 is itself a flaky method. It introduces cumulative bias into results. It only exists, as far as I can tell, to allow old hardware to call itself IEEE-compatible. — TonyK, Apr 11 '11 at 13:17

TonyK · Accepted Answer · 2011-04-11T11:06:49.347

5

IEEE specifies five different rounding modes, but the one to use by default is Round half to even. So you have a mantissa of the form 10001100 11001100 11001100 11001100... and you have to round it to 24 bits. Numbering the bits from 0 (most significant), bit 24 is 1; but that is not enough to tell you whether to round bit 23 up or not. If all the remaining bits were 0, you would not round up, because bit 23 is 0 (even). But the remaining bits are not zero, so you round up in all cases.

Some examples:

10001100 11001100 11001100 10000000...(all zero) doesn't round up, because bit 23 is already even.

10001100 11001100 11001101 10000000...(all zero) does round up, because bit 23 is odd.

10001100 11001100 1100110x 10000000...0001 always rounds up, because the remaining bits are not all zero.

10001100 11001100 1100110x 0xxxxxxx... never rounds up, because bit 24 is zero.

edited Apr 11 '11 at 11:06

answered Apr 10 '11 at 21:06

TonyK

15,585
4
29
65

Thanks TonyK, for clarification: with 1.1, the last bits that are included from ((in[3] & 0xe0) >> 5) are 100, and the bits following that cut off are 11001001010001111010111000011. The correct form would have the last bits of the float be 101. How does this work with the "round up"? – baalexander Apr 13 '11 at 05:42
I added rounding logic based on the above. It looks to have resolved the issue on my basic testing (1.1, 1.11, 1.2, etc.), though the testing was not extensive yet. The code can be seen at https://github.com/baalexander/avr_bridge/blob/master/scripts/avr_ros/serializer.h. I am working on 32-bit to 64-bit now. Thanks for pointing me in the right direction TonyK! – baalexander Apr 15 '11 at 04:06

score 3 · Answer 2 · answered Apr 10 '11 at 22:11

The following code seems to convert from single precision to double precision. I'll leave it as an exercise to the reader to implement the narrowing version. This should get you started though. The hardest part is getting the bit positions correct in the significand. I includes some comments that include what is going on.

double
extend_float(float f)
{
    unsigned char flt_bits[sizeof(float)];
    unsigned char dbl_bits[sizeof(double)] = {0};
    unsigned char sign_bit;
    unsigned char exponent;
    unsigned int  significand;
    double out;

    memcpy(&flt_bits[0], &f, sizeof(flt_bits));
    /// printf("---------------------------------------\n");
    /// printf("float = %f\n", f);
#if LITTLE_ENDIAN
    reverse_bytes(flt_bits, sizeof(flt_bits));
#endif
    /// dump_bits(&flt_bits[0], sizeof(flt_bits));

    /* IEEE 754 single precision
     *    1 sign bit              flt_bits[0] & 0x80
     *    8 exponent bits         flt_bits[0] & 0x7F | flt_bits[1] & 0x80
     *   23 fractional bits       flt_bits[1] & 0x7F | flt_bits[2] & 0xFF |
     *                            flt_bits[3] & 0xFF
     *
     * E = 0   & F  = 0 -> +/- zero
     * E = 0   & F != 0 -> sub-normal
     * E = 127 & F  = 0 -> +/- INF
     * E = 127 & F != 0 -> NaN
     */
    sign_bit = (flt_bits[0] & 0x80) >> 7;
    exponent = ((flt_bits[0] & 0x7F) << 1) | ((flt_bits[1] & 0x80) >> 7);
    significand = (((flt_bits[1] & 0x7F) << 16) |
                   (flt_bits[2] << 8) |
                   (flt_bits[3]));

    /* IEEE 754 double precision
     *    1 sign bit              dbl_bits[0] & 0x80
     *   11 exponent bits         dbl_bits[0] & 0x7F | dbl_bits[1] & 0xF0
     *   52 fractional bits       dbl_bits[1] & 0x0F | dbl_bits[2] & 0xFF
     *                            dbl_bits[3] & 0xFF | dbl_bits[4] & 0xFF
     *                            dbl_bits[5] & 0xFF | dbl_bits[6] & 0xFF
     *                            dbl_bits[7] & 0xFF
     *
     * E = 0    & F  = 0 -> +/- zero
     * E = 0    & F != 0 -> sub-normal
     * E = x7FF & F  = 0 -> +/- INF
     * E = x7FF & F != 0 -> NaN
     */
    dbl_bits[0] = flt_bits[0] & 0x80; /* pass the sign bit along */

    if (exponent == 0) {
        if (significand  == 0) { /* +/- zero */
            /* nothing left to do for the outgoing double */
        } else { /* sub-normal number */
            /* not sure ... pass on the significand?? */
        }
    } else if (exponent == 0xFF) { /* +/-INF and NaN */
        dbl_bits[0] |= 0x7F;
        dbl_bits[1]  = 0xF0;
        /* pass on the significand */
    } else { /* normal number */
        signed int int_exp = exponent;
        int_exp -= 127;  /* IEEE754 single precision exponent bias */
        int_exp += 1023; /* IEEE754 double precision exponent bias */
        dbl_bits[0] |= (int_exp & 0x7F0) >> 4;  /* 7 bits */
        dbl_bits[1]  = (int_exp & 0x00F) << 4;  /* 4 bits */
    }

    if (significand != 0) {
        /* pass on the significand most-significant-bit first */
        dbl_bits[1] |=  (flt_bits[1] & 0x78) >> 3;    /* 4 bits */
        dbl_bits[2] = (((flt_bits[1] & 0x07) << 5) |  /* 3 bits */
                       ((flt_bits[2] & 0xF8) >> 3));  /* 5 bits */
        dbl_bits[3] = (((flt_bits[2] & 0x07) << 5) |  /* 3 bits */
                       ((flt_bits[3] & 0xF8) >> 3));  /* 5 bits */
        dbl_bits[4] =  ((flt_bits[3] & 0x07) << 5);   /* 3 bits */
    }

    ///dump_bits(&dbl_bits[0], sizeof(dbl_bits));
#if LITTLE_ENDIAN
    reverse_bytes(&dbl_bits[0], sizeof(dbl_bits));
#endif
    memcpy(&out, &dbl_bits[0], sizeof(out));

    return out;
}

I left some printf lines in but commented out in C++ style comments. You will have to provide the appropriate definitions for reverse_bytes, LITTLE_ENDIAN, and dump_bits. I didn't want to spoil all of the fun for you afterall. The Wikipedia entries on single precision and double precision numbers are very good.

If you are going to be tinkering with floating point numbers a lot, you should read "What Every Computer Scientist Should Know About Floating Point Arithmetic" by David Goldberg and "How to Print Floating Point Numbers Accurately" by Steele and White. They are the two most informative articles out there when it comes to understanding how floating point numbers work.

Widening is easy. It's the narrowing that's hard to get right. — TonyK, Apr 10 '11 at 22:26
Once you have the bits pulled apart it isn't that bad ... well ... provided that you have an implementation of `fegetround` or some way to get the active rounding mode. — D.Shawley, Apr 10 '11 at 22:32
Exactly. Also, it gets tricky when the lsb round-up causes overflow. OK, not *that* tricky, but definitely harder than widening. I wouldn't call it an 'exercise for the reader'. — TonyK, Apr 10 '11 at 23:00
Thanks! I just finished the widening code, which was heavily based off your example code. Source can be seen in my comment to the answer. — baalexander, Apr 17 '11 at 04:45

score 1 · Answer 3 · answered Apr 10 '11 at 20:34

http://www.google.com/search?q=c+convert+ieee+754+double+single

one of the first results is this:

http://www.mathworks.com/matlabcentral/fileexchange/23173

The code shows how to convert IEEE-754 double into IEEE-754-like (1,5,10) floating format. This code contains lots of comments, and mentions the typical traps that you might fall into.

It's not exactly what you want, but it's a good starting point.

score 1 · Answer 4 · answered Apr 11 '11 at 08:40

There is only one full implementation of IEEE754 double in GCC for AVR that I am aware of, and you can find it here.

You will need this archive and then replace avr_f64.c from the archive with this one.

Library takes about 21K Flash and 310 bytes of ram.

Original post can be found here. I have extracted all important information from the original post and presented here since I think that you need to have an account to log in the forum.

Parse double precision IEEE floating-point on a C compiler with no double precision type

4 Answers4

Linked