Questions tagged [single-precision]

46 questions
33
votes
6 answers

Building a 32-bit float out of its 4 composite bytes

I'm trying to build a 32-bit float out of its 4 composite bytes. Is there a better (or more portable) way to do this than with the following method? #include typedef unsigned char uchar; float bytesToFloat(uchar b0, uchar b1, uchar b2,…
Madgeek
  • 333
  • 1
  • 3
  • 6
13
votes
5 answers

How does C know what type to expect?

If all values are nothing more than one or more bytes, and no byte can contain metadata, how does the system keep track of what sort of number a byte represents? Looking into Two's Complement and Single Point on Wikipedia reveals how these numbers…
Jack Stout
  • 1,100
  • 2
  • 9
  • 24
8
votes
1 answer

CUDA C using single precision flop on doubles

The problem During a project in CUDA C, I came across unexpected behaviour regarding single precision and double precision floating point operations. In the project, I first fill an array with number in a kernel and in another kernel, I do some…
Frank
  • 352
  • 2
  • 13
7
votes
1 answer

Single-precision arithmetic broken when running x86-compiled code on a 64-bit machine

When you read MSDN on System.Single: Single complies with the IEC 60559:1989 (IEEE 754) standard for binary floating-point arithmetic. and the C# Language Specification: The float and double types are represented using the 32-bit…
Jeppe Stig Nielsen
  • 54,796
  • 9
  • 96
  • 154
4
votes
2 answers

Approximating cosine on [0,pi] using only single precision floating point

i'm currently working on an approximation of the cosine. Since the final target device is a self-developement working with 32 bit floating point ALU / LU and there is a specialized compiler for C, I am not able to use the c library math functions…
4
votes
3 answers

How to keep precision on int64_t = int64_t * float?

I would like to perform a correction on an int64_t by a factor in the range [0.01..1.2] with precision is about 0.01. The naive implementation would be: int64_t apply_correction(int64_t y, float32_t factor) { return y *…
nowox
  • 19,233
  • 18
  • 91
  • 202
3
votes
1 answer

Single precision argument reduction for trigonometric functions in C

I have implemented some approximations for trigonometric functions (sin,cos,arctan) computed with single precision (32 bit floating point) in C. They are accurate to about +/- 2 ulp. My target device does not support any or methods.…
3
votes
1 answer

After converting bits to Double, how to store actual float/double value without using BigDecimal?

According to several floating point calculators and as well as my code below, the following 32 bits 00111111010000000100000110001001 has an actual Floating Point value of (0.750999987125396728515625). Since it is the actual Float value, I should…
3
votes
1 answer

C - adding two single-precision floating point normal numbers, can't get result to infinity

I'm playing around with floating-point arithmetic, and I encountered something which needs explaining. When setting rounding mode to 'towards zero', aka: fesetround(FE_TOWARDZERO); And adding different kind of normal positive numbers, I can never…
Shay Golan
  • 89
  • 7
3
votes
2 answers

subtracting double precision from single precision gives me 0. not what I want

i am trying to examining the round-off error associated with sin(x) using Octave I get these numbers: >> single(sin(10)) ans = -0.544021129608154 >> sin(10) ans = -0.544021110889370 >> (single(sin(10))) - (sin(10)) ans = 0 which should be :…
user35053
  • 33
  • 6
2
votes
1 answer

IEEE 64 and 32 bit float validation in OCaml

I have a string matching the following regex \-?[0-9]*\.[0-9]+ which supposedly represents a IEEE floating point number. It could be single or double precision and I know the type in advance. I need to check if it could be interpreted as a valid…
krokodil
  • 1,286
  • 9
  • 17
2
votes
1 answer

Iterate through single precision floating point numbers between [1,2)

I am working on program that requires me to iterate through all single precision floating point (23 fractions bits) numbers in the range of [1,2). I am not quite sure how to go about this. I am writing this program in C#. If someone could give me…
1
vote
2 answers

What is the difference between a uint8 and a single image?

I already know uint8 contains intensity values between 0 and 255 (28-1) and single contains values between 0 and 1, it is used to hold larger values without upsetting the range error. But, apart from that, are there any other differences? What is…
1
vote
0 answers

Can't replace Fortran real variables by double precision variables or more precision

I am using a known code (CAMB) which generates values like this : k(h/Mpc) Pk/s8^2(Mpc/h)^3 5.2781500000e-06 1.9477400000e+01 5.5479700000e-06 2.0432300000e+01 5.8315700000e-06 2.1434000000e+01 6.1296700000e-06 2.2484700000e+01 6.4430100000e-06…
youpilat13
  • 65
  • 3
  • 25
  • 65
1
vote
1 answer

Fixed-point instead of floating point

How many bits does fixed-point number need to be at least as precise as floating point number? If I wanted to carry calculations in fixed-point arithmetic instead of floating-point, how many bits would I need for the calculations to be not less…
Ecir Hana
  • 9,122
  • 13
  • 58
  • 105
1
2 3 4