3

This question is very similar to this post - but not exactly

I have some data in a .csv file. The data has precision to the 4th digit (#.####).

Calculating the mean in Excel or SAS gives a result with precision to 5th digit (#.#####) but using numpy gives:

import numpy as np
data = np.recfromcsv(path2file, delimiter=';', names=['measurements'], dtype=np.float64)
rawD = data['measurements']
print np.average(rawD)

gives a number like this

#.#####999999999994

Clearly something is wrong..

using

from math import fsum
print fsum(rawD.ravel())/rawD.size

gives

#.#####

Is there anything in the np.average that I set wrong _______?


BONUS info:

I'm only working with 200 data points in the array


UPDATE

I thought I should make my case more clear.

I have numbers like 4.2730 in my csv (giving a 4 decimal precision - even though the 4th always is zero [not part of the subject so don't mind that])

Calculating an average/mean by numpy gives me this

4.2516499999999994

Which gives a print by

>>>print "%.4f" % np.average(rawD)
4.2516

During the same thing in Excel or SAS gives me this:

4.2517

Which I actually believe as being the true average value because it finds it to be 4.25165. This code also illustrate it:

answer = 0
for number in rawD:
    answer += int(number*1000)
print answer/2
425165

So how do I tell np.average() to calculate this value ___?


I'm a bit surprised that numpy did this to me... I thought that I only needed to worry if I was dealing with 16 digits numbers. Didn't expect a round off on the 4 decimal place would be influenced by this..

I know I could use

fsum(rawD.ravel())/rawD.size

But I also have other things (like std) I want to calculate with the same precision

UPDATE 2

I thought I could make a temp solution by

>>>print "%.4f" % np.float64("%.5f" % np.mean(rawD))
4.2416

Which did not solve the case. Then I tried

>>>print "%.4f" % float("4.24165")
4.2416

AHA! There is a bug in the formatter: Issue 5118

To be honest I don't care if python stores 4.24165 as 4.241649999... It's still a round off error - NO MATTER WHAT.

If the interpeter can figure out how to display the number

>>>print float("4.24165")
4.24165

Then should the formatter as well and deal with that number when rounding..

It still doesn't change the fact that I have a round off problem (now both with the formatter and numpy)

In case you need some numbers to help me out then I have made this modified .csv file:

Download it from here

(I'm aware that this file does not have the number of digits I explained earlier and that the average gives ..9988 at the end instead of ..9994 - it's modified)

Guess my qeustion boils down to how do I get a string output like the one excel gives me if I use =average()

enter image description here

and have it round off correctly if I choose to show only 4 digits

enter image description here

I know that this might seem strange for some.. But I have my reasons for wanting to reproduce the behavior of Excel.

Any help would be appreciated, thank you.

Community
  • 1
  • 1
Norfeldt
  • 5,230
  • 13
  • 70
  • 118
  • You've already found Issue 5118 where it is explained why such behavior is by design. If you need accuracy down to the very last decimal, you could use the decimal module. – e1i45 Aug 02 '13 at 15:52

2 Answers2

3

To get exact decimal numbers, you need to use decimal arithmetic instead of binary. Python provides the decimal module for this.

If you want to continue to use numpy for the calculations and simply round the result, you can still do this with decimal. You do it in two steps, rounding to a large number of digits to eliminate the accumulated error, then rounding to the desired precision. The quantize method is used for rounding.

from decimal import Decimal,ROUND_HALF_UP
ten_places = Decimal('0.0000000001')
four_places = Decimal('0.0001')
mean = 4.2516499999999994
print Decimal(mean).quantize(ten_places).quantize(four_places, rounding=ROUND_HALF_UP)
4.2517
Mark Ransom
  • 271,357
  • 39
  • 345
  • 578
0

The result value of average is a double. When you print out a double, by default all digits are printed. What you see here is the result of limited digital precision, which is not a problem of numpy, but a general computing problem. When you care of the presentation of your float value, use "%.4f" % avg_val. There is also a package for rational numbers, to avoid representing fractions as real numbers, but I guess that's not what you're looking for.

For your second statement, summarizing all the values by hand and then dividing it, I suppose you're using python 2.7 and all your input values are integer. In that way, you would have an integer division, which truncates everything after the dot, resulting in another integer value.

Michael
  • 6,002
  • 32
  • 55
  • my problem was actually that the `"%.4f" % avg_val` rounds down the last digit – Norfeldt Jul 30 '13 at 11:38
  • I use python 2.7 yes, but my values are floats – Norfeldt Jul 30 '13 at 11:39
  • (I use python 2.7 [32bit] but didn't expect to have any problems with 4 digit numbers) – Norfeldt Jul 31 '13 at 11:04
  • For what I see, numpy is doing everything right. What result do you get from excel with higher precision? Are you sure, it's rounding is correct? Is it rounding .5 upwards or towards nearest even digit? I also couldn't find `fmsum` in my numpy, so not sure, what implementation details are different there. – Michael Jul 31 '13 at 15:57
  • Also, if you really need these kind of precision, don't work with float variables at all and better use the `decimal` module, exactly as mentioned in the bug report. – Michael Jul 31 '13 at 16:02
  • Thank you for your time @Michael I really appreciate it. The fmsum is from the math module (I just forgot to mention it) – Norfeldt Jul 31 '13 at 20:43
  • BTW I made a typing error. Not `fmsum` but `fsum` from the math module. Sorry for that – Norfeldt Aug 03 '13 at 09:47