Floating point representations seem to do integer arithmetic correctly - why?

Question

I've been playing around with floating point numbers a little bit, and based on what I've learned about them in the past, the fact that 0.1 + 0.2 ends up being something like 0.30000000000000004 doesn't surprise me.

What does surprise me, however, is that integer arithmetic always seems to work just fine and not have any of these artifacts.

I first noticed this in JavaScript (Chrome V8 in node.js):

0.1 + 0.2 == 0.3 // false, NOT surprising
123456789012 + 18 == 123456789030  // true
22334455667788 + 998877665544 == 23333333333332 // true
1048576 / 1024 == 1024  // true

C++ (gcc on Mac OS X) seems to have the same properties.

The net result seems to be that integer numbers just — for lack of a better word — work. It's only when I start using decimal numbers that things get wonky.

Is this is a feature of the design, an mathematical artifact, or some optimisation done by compilers and runtime environments?

Integers work until you make them big enough. You have 53 bits (I think) in the mantissa of a typical floating point implementation. That's enough for some big integers, but make them big enough and you'll get problems. — john, Oct 25 '12 at 08:39
Is `22334455667788` an integer or a floating-point constant in Javascript? — Andrey, Oct 25 '12 at 08:44
Lots of great answers to this question, all centre around the fact that the floating point mantissa CAN accurately represent all integers less than 2^53. — MarcWan, Oct 26 '12 at 04:42
So, sorry for all the answers I *didn't* choose. Can only pick one @_@ — MarcWan, Oct 26 '12 at 04:42
@Andrey -- JavaScript ONLY has floating point numbers. All numbers are 64bit IEEE 754 double precision floating point. — MarcWan, Oct 26 '12 at 04:45

score 4 · Answer 1 · answered Oct 25 '12 at 10:16

Is this is a feature of the design, an mathematical artifact, or some optimisation done by compilers and runtime environments?

It's a feature of the real numbers. A theorem from modern algebra (modern algebra, not high school algebra; math majors take a class in modern algebra after their basic calculus and linear algebra classes) says that for some positive integer b, any positive real number r can be expressed as r = a * b^p, where a is in [1,b) and p is some integer. For example, 1024₁₀ = 1.024₁₀*10³. It is this theorem that justifies our use of scientific notation.

That number a can be classified as terminal (e.g. 1.0), repeating (1/3=0.333...), or non-repeating (the representation of pi). There's a minor issue here with terminal numbers. Any terminal number can be also be represented as a repeating number. For example, 0.999... and 1 are the same number. This ambiguity in representation can be resolved by specifying that numbers that can be represented as terminal numbers are represented as such.

What you have discovered is a consequence of the fact that all integers have a terminal representation in any base.

There is an issue here with how the reals are represented in a computer. Just as int and long long int don't represent all of integers, float and double don't represent all of the reals. The scheme used on most computer to represent a real number r is to represent in the form r = a*2^p, but with the mantissa (or significand) a truncated to a certain number of bits and the exponent p limited to some finite number. What this means is that some integers cannot be represented exactly. For example, even though a googol (10¹⁰⁰) is an integer, it's floating point representation is not exact. The base 2 representation of a googol is a 333 bit number. This 333 bit mantissa is truncated to 52+1 bits.

On consequence of this is that double precision arithmetic is no longer exact, even for integers if the integers in question are greater than 2⁵³. Try your experiment using the type unsigned long long int on values between 2⁵³ and 2⁶⁴. You'll find that double precision arithmetic is no longer exact for these large integers.

score 3 · Accepted Answer · edited May 23 '17 at 12:03

I'm writing that under assumption that Javascript uses double-precision floating-point representation for all numbers.

Some numbers have an exact representation in the floating-point format, in particular, all integers such that |x| < 2^53. Some numbers don't, in particular, fractions such as 0.1 or 0.2 which become infinite fractions in binary representation.

If all operands and the result of an operation have an exact representation, then it would be safe to compare the result using ==.

Related questions:

What number in binary can only be represented as an approximation?

Why can't decimal numbers be represented exactly in binary?

You can extend it to |x|<=2^53 for obvious reasons 2^53 is also represented exactly in IEEE 754 double precision floating point format. The smallest magnitude integer that cannot is 2^53+1. — aka.nice, Oct 25 '12 at 20:17

score 2 · Answer 3 · answered Oct 25 '12 at 08:40

2

Integers withing the representable range are exactly representable by the machine, floats are not (well, most of them).

If by "basic integer math" you understand "feature", then yes, you can assume correctly implementing arithmetic is a feature.

answered Oct 25 '12 at 08:40

Luchian Grigore

236,802
53
428
594

OP is apparently assuming that all numbers in his example are of a floating-point type – Andrey Oct 25 '12 at 08:43
@Andrey he seems fully aware that they're integers - "integer arithmetic always seems to work just fine" – Luchian Grigore Oct 25 '12 at 08:44
By reading the title, I would think he might mean "floating-point arithmetic always seems to work just fine with whole numbers". I don't know what is the datatype of those constants in javascript – Andrey Oct 25 '12 at 08:47
@Andrey: There is only one numeric type in Javascript, and most other languages would describe it as `double`. So there's no difference in Javascript between "integer arithmetic" and "floating point arithmetic with whole numbers". That's "logically", of course -- I don't know whether Javascript optimizers detect when they can use integer ops – Steve Jessop Oct 25 '12 at 09:00
@SteveJessop `2+2` is executed as `(float)2+(float)2`, right? – Andrey Oct 25 '12 at 09:04
@Andrey: if by `float` you mean "IEEE double precision floating-point", then that is its meaning. How it's executed is up to the optimizer. – Steve Jessop Oct 25 '12 at 09:05

LuigiEdlCarno · Answer 4 · 2012-10-25T09:12:31.323

2

The reason is, that you can represent every whole number (1, 2, 3, ...) exactly in binary format (0001, 0010, 0011, ...)

That is why integers are always correct because 0011 - 0001 is always 0010. The problem with floating point numbers is, that the part after the point cannot be exactly converted to binary.

edited Oct 25 '12 at 09:12

answered Oct 25 '12 at 08:41

LuigiEdlCarno

2,362
1
18
37

score 1 · Answer 5 · answered Oct 25 '12 at 08:40

All of the cases that you say "work" are ones where the numbers you have given can be represented exactly in the floating point format. You'll find that adding 0.25 and 0.5 and 0.125 works exactly too because they can also be represented exactly in a binary floating point number.

it's only values that can't be such as 0.1 where you'll get what appear to be inexact results.

score 1 · Answer 6 · edited May 23 '17 at 12:12

1

Integers are exact because because the imprecision results mainly from the way we write decimal fractions, and secondarily because many rational numbers simply don't have non-repeating representations in any given base.

See: https://stackoverflow.com/a/9650037/140740 for the full explanation.

edited May 23 '17 at 12:12

Community

1
1

answered Nov 16 '12 at 01:37

DigitalRoss

135,013
23
230
316

score 0 · Answer 7 · answered Oct 25 '12 at 08:44

0

That method only works, when you are adding a small enough integer to very large integer -- and even in that case you are not representing both of the integers in the 'floating point' format.

answered Oct 25 '12 at 08:44

Aki Suihkonen

15,929
1
30
50

score -1 · Answer 8 · answered Oct 25 '12 at 08:44

-1

All floating point numbers can't be represented. it's due to the way of coding them. The wiki page explain it better than me: http://en.wikipedia.org/wiki/IEEE_754-1985. So when you are trying to compare a floating point number, you should use a delta:

myFloat - expectedFloat < delta

You can use the smallest representable floating point number as delta.

answered Oct 25 '12 at 08:44

Patrice Bernassola

13,271
3
41
54

1

This is not the way to do it. The comparison should be done in proportion to the values tested. It would mean subtracting the floats or doubles as *integers* and testing if the delta <= N, where N is also a small integer – Aki Suihkonen Oct 25 '12 at 08:46
1

"Nearly equal" is an advanced technique. Don't use it unless you thoroughly understand its implications. In particular, when `a` is nearly equal to `b` and `b` is nearly equal to `c`, it does not follow that `a` is nearly equal to `c`. – Pete Becker Oct 25 '12 at 14:30

Floating point representations seem to do integer arithmetic correctly - why?

8 Answers8

Linked

Related