The situation is that I'm on a 32-bit embedded platform (Cortex-M4F) which has a hardware FPU. I'd really like to use the FPU, but the platform provides no hardware implementation of 64-bit float operations— any 64-bit operation triggers a hardfault.
But I also need to be able to send a few key 64-bit floats over a serial port to a PC. There's a great question here about upgrading a float32 to a float64 by simply copying over the components of the IEEE 754 float representation, so that's been my starting point.
However, I'd really like to be able to populate this float64 field with the result of a sum between an int32 accumulator and float32 fractional component. I assume the operation would be something to the effect of:
- convert the accumulator to float32; copy its exponent to the final float64.
- determine the difference between the exponent of the accumulator float32 and fractional float32.
- shift the mantissa of the fractional component according to the exponent difference, and add it to the mantissa of the accumulator float— place this value into the float64.
This will never achieve beyond the range of an int32, but I believe it will be considerably better precision than a straight float32 as the magnitude becomes large.
Are there any particular gotchas I should watch for as I'm implementing this? Any libraries or existing code which could be of help in composing and decomposing these structures? Thanks!