How to convert packed integer (16.16) fixed-point to float?

Question

How to convert a "32-bit signed fixed-point number (16.16)" to a float?

Is (fixed >> 16) + (fixed & 0xffff) / 65536.0 ok? What about -2.5? And -0.5?

Or is fixed / 65536.0 the right way?

(PS: How does signed fixed-point "-0.5" looks like in memory anyway?)

CodesInChaos · Accepted Answer · 2011-12-26T20:25:37.683

I assume two's complement 32 bit integers and operators working as in C#.

How to do the conversion?

fixed / 65536.0

is correct and easy to understand.

(fixed >> 16) + (fixed & 0xffff) / 65536.0

Is equivalent to the above for positive integers, but slower, and harder to read. You're basically using the distributive law to separate a single division into two divisions, and write the first one using a bitshift.

For negative integers fixed & 0xffff doesn't give you the fractional bits, so it's not correct for negative numbers.

Look at the raw integer -1 which should map to -1/65536. This code returns 65535/65536 instead.

Depending on your compiler it might be faster to do:

fixed * (1/65536.0)

But I assume most modern compilers already do that optimization.

How does signed fixed-point "-0.5" looks like in memory anyway?

Inverting the conversion gives us:

RoundToInt(float*65536)

Setting float=-0.5 gives us: -32768.

score 6 · Answer 2 · answered Jul 09 '12 at 09:26

class FixedPointUtils {
  public static final int ONE = 0x10000;

  /**
   * Convert an array of floats to 16.16 fixed-point
   * @param arr The array
   * @return A newly allocated array of fixed-point values.
   */
  public static int[] toFixed(float[] arr) {
    int[] res = new int[arr.length];
    toFixed(arr, res);
    return res;
  }

  /**
   * Convert a float to  16.16 fixed-point representation
   * @param val The value to convert
   * @return The resulting fixed-point representation
   */
  public static int toFixed(float val) {
    return (int)(val * 65536F);
  }

  /**
   * Convert an array of floats to 16.16 fixed-point
   * @param arr The array of floats
   * @param storage The location to store the fixed-point values.
   */
  public static void toFixed(float[] arr, int[] storage)
  {
    for (int i=0;i<storage.length;i++) {
      storage[i] = toFixed(arr[i]);
    }
  }

  /**
   * Convert a 16.16 fixed-point value to floating point
   * @param val The fixed-point value
   * @return The equivalent floating-point value.
   */
  public static float toFloat(int val) {
    return ((float)val)/65536.0f;
  }

  /**
   * Convert an array of 16.16 fixed-point values to floating point
   * @param arr The array to convert
   * @return A newly allocated array of floats.
   */
  public static float[] toFloat(int[] arr) {
    float[] res = new float[arr.length];
    toFloat(arr, res);
    return res;
  }

  /**
   * Convert an array of 16.16 fixed-point values to floating point
   * @param arr The array to convert
   * @param storage Pre-allocated storage for the result.
   */
  public static void toFloat(int[] arr, float[] storage)
  {
    for (int i=0;i<storage.length;i++) {
      storage[i] = toFloat(arr[i]);
    }
  }

}

score 0 · Answer 3 · answered Jan 31 '18 at 13:15

After reading an answer by CodesInChaos I wrote a C++ function template, which is very convenient. You can pass the length of fractional part (for example, BMP file format uses 2.30 fixed point numbers). If fractional part length is omitted, the function assumes that fractional and integer parts have the same length

#include <math.h> // for NaN
#include <limits.h> // for CHAR_BIT = 8

template<class T> inline double fixed_point2double(const T& x, int frac_digits = (CHAR_BIT * sizeof(T)) / 2 )
{
  if (frac_digits >= CHAR_BIT * sizeof(T)) return NAN;
  return double(x) / double( T(1) << frac_digits) );
}

And if you want to read such number from memory, I wrote a function template

#include <math.h> // for NaN
#include <limits.h> // for CHAR_BIT = 8

template<class T> inline double read_little_endian_fixed_point(const unsigned char *x, int frac_digits = (CHAR_BIT * sizeof(T)) / 2)
// ! do not use for single byte types 'T'
{
  if (frac_digits >= CHAR_BIT * sizeof(T)) return NAN;

  T res = 0;

  for (int i = 0, shift = 0; i < sizeof(T); ++i, shift += CHAR_BIT)
    res |= ((T)x[i]) << shift;

  return double(res) / double( T(1) << frac_digits) );
}

How to convert packed integer (16.16) fixed-point to float?

3 Answers3

How to do the conversion?

How does signed fixed-point "-0.5" looks like in memory anyway?