Errors multiplying large doubles

Question

I've made a BOMDAS calculator in C++ that uses doubles. Whenever I input an expression like

1000000000000000000000*1000000000000000000000

I get a result like 1000000000000000000004341624882808674582528.000000. I suspect it has something to do with floating-point numbers.

http://stackoverflow.com/questions/872544/precision-of-floating-point has more information on floating point precision limitations which will be relevant - if you use FP representation in your calculator, you will have to expect that you will see these sort of problems. — mc110, Jun 14 '14 at 21:38
Or http://stackoverflow.com/questions/9999221/double-precision-decimal-places — Drew Dormann, Jun 14 '14 at 21:43
Yes, with floating point you extend the range of values you can hold in a specific number of bits at the cost of limited precision. That is why you would not expect to use FP for something like financial transactions where you need the answer to be completely accurate. — mc110, Jun 14 '14 at 21:50
@mc110 If you'd like to put that into answer form, I'd be happy to mark it as the answer — Sky Lightna, Jun 14 '14 at 21:52
@SkyLightna you only have to live with that if you have to live with using `double`s. You can get "infinite" precision with other types. — Drew Dormann, Jun 14 '14 at 21:59
Every data type is a trade-off. FP goes for space and time efficient representation and arithmetic for close approximations to a wide range of values. If that is not the right trade-off for your application, use a different data type. — Patricia Shanahan, Jun 15 '14 at 14:51

score 0 · Answer 1 · answered Jun 14 '14 at 22:16

Floating point number represent values with a fixed size representation. A double can represent 16 decimal digits in form where the decimal digits can be restored (internally, it normally stores the value using base 2 which means that it can accurately represent most fractional decimal values). If the number of digits is exceeded, the value will be rounded appropriately. Of course, the upshot is that you won't necessarily get back the digits you're hoping for: if you ask for more then 16 decimal digits either explicitly or implicitly (e.g. by setting the format to std::ios_base::fixed with numbers which are bigger than 1e16) the formatting will conjure up more digits: it will accurately represent the internally held binary values which may produce up to, I think, 54 non-zero digits.

If you want to compute with large values accurately, you'll need some variable sized representation. Since your values are integers a big integer representation might work. These will typically be a lot slower to compute with than double.

score 0 · Answer 2 · answered Jun 14 '14 at 22:18

0

A double stores 53 bits of precision. This is about 15 decimal digits. Your problem is that a double cannot store the number of digits you are trying to store. Digits after the 15th decimal digit will not be accurate.

answered Jun 14 '14 at 22:18

Louis Newstrom

162
1
1
11

So if I somehow limit the size of the double to 15 digits, I'd get a more accurate looking result? – Sky Lightna Jun 14 '14 at 23:01
@SkyLightna no, multiplying two 15-digit numbers results in a 30-digit one. You'll only get the first 15 digits of the result (or more precisely 53 bits) correctly – phuclv Jun 16 '14 at 02:33
@LưuVĩnhPhúc So I could convert the result to a string and truncate it to make it fit within the accurate limits and then convert it back to a double? – Sky Lightna Jun 21 '14 at 01:06
no, no matter how much precision the string can store, after converting back to double it can only store the most 15 significant digits. Like I said, use int64_t __int128_t if that fits you, otherwise you'll need to use arbitrary-precision arithmetic. Not to mention string is extremely inefficient, libraries will use base 2^32 (or 2^64 in 64-bit systems) instead of base 10 if you use string – phuclv Jun 21 '14 at 02:14

phuclv · Answer 3 · 2019-04-26T15:38:07.543

That's not an error. It's exactly because of how floating-point types are represented, as the result is precise to double precision.

Floating-point types in computers are written in the form (-1)^sign * mantissa * 2^exp so they only have broader ranges, not infinite precision. They're only accurate to the mantissa precision, and the result after every operation will be rounded as such. The double type is most commonly implemented as IEEE-754 64-bit double precision with 53 bits of mantissa so it can be correct to log(2⁵³) ≈ 15.955 decimal digits. Doing 1e21*1e21 produces 1e42 which when rounding to the closest value in double precision gives the value that you saw. If you round that to 16 digits it's exactly the same as 1e42.

If you need more range, use double or long double. If you only works with integer then int64_t (or __int128 with gcc and many other compilers on 64-bit platforms) has a much larger precision (64/128 bits compared to 53 bits). If you need even more precision, use an arbitrary-precision arithmetic library instead such as GMP

Actually, "error" is the mathematical term for the difference between approximated value and actual value. It **is** an error. — SOFe, Jun 13 '20 at 08:52

Errors multiplying large doubles

3 Answers3