2

The situation is that I'm on a 32-bit embedded platform (Cortex-M4F) which has a hardware FPU. I'd really like to use the FPU, but the platform provides no hardware implementation of 64-bit float operations— any 64-bit operation triggers a hardfault.

But I also need to be able to send a few key 64-bit floats over a serial port to a PC. There's a great question here about upgrading a float32 to a float64 by simply copying over the components of the IEEE 754 float representation, so that's been my starting point.

However, I'd really like to be able to populate this float64 field with the result of a sum between an int32 accumulator and float32 fractional component. I assume the operation would be something to the effect of:

  • convert the accumulator to float32; copy its exponent to the final float64.
  • determine the difference between the exponent of the accumulator float32 and fractional float32.
  • shift the mantissa of the fractional component according to the exponent difference, and add it to the mantissa of the accumulator float— place this value into the float64.

This will never achieve beyond the range of an int32, but I believe it will be considerably better precision than a straight float32 as the magnitude becomes large.

Are there any particular gotchas I should watch for as I'm implementing this? Any libraries or existing code which could be of help in composing and decomposing these structures? Thanks!

Community
  • 1
  • 1
mikepurvis
  • 1,448
  • 1
  • 18
  • 27

1 Answers1

0

The bit-wrangling to convert 32 bit float into 64 bit is relatively straightforward. (Though I would worry about byte order.)

Adding a 32 bit integer to the 24/53 bit mantissa, is a more interesting problem... involving aligning the mantissa and integer (sorting out the sign of the integer, converting to sign and magnitude) doing the rounding and the guard-bit shuffle, normalising and rounding as required. You'll need 64 bit unsigned integer arithmetic, shifts etc. to do this conveniently. [I've written software floating point libraries... it's lots of fun, but tricky.].

I guess you cannot change the PC end to accept two doubles and do the add there ? Easier yet, accept the float and the integer !