Causes of floating point non-determinism? Including NumPy?

Question

IEEE floating point operations are deterministic, but see How can floating point calculations be made deterministic? for one way that an overall floating point computation can be non-deterministic:

... parallel computations are non-deterministic in terms of the order in which floating-point computations are performed, which can result in non-bit-exact results across runs.

Two-part question:

How else can an overall floating point computation be non-deterministic, yielding results that are not exactly equal?
Consider a single-threaded Python program that calls NumPy, CVXOPT, and SciPy subroutines such as scipy.optimize.fsolve(), which in turn call native libraries like MINPACK and GLPK and optimized linear algebra subroutines like BLAS, ATLAS, and MKL. “If your numpy/scipy is compiled using one of these, then dot() will be computed in parallel (if this is faster) without you doing anything.”

Do these native libraries ever parallelize in a way that introduces non-deterministic results?

Assumptions:

The same software, with the same inputs, on the same hardware. The output of multiple runs should be equal.
- If that works, it's highly desirable to test that the output after doing a code refactoring is equal. (Yes, some changes in order of operations can make some of the output not-equal.)
All random numbers in the program are psuedo-random numbers used in a consistent way from the same seeds across all runs.
No uninitialized values. Python is generally safe in that way but numpy.empty() returns a new array without initializing entries. And it's not clear that it's much faster in practice. So beware!
- @PaulPanzer's test shows that numpy.empty() does return an uninitialized array and it can easily and quickly recycle a recent array:
  
  import numpy as np np.arange(100); np.empty(100, int); np.empty(100, int) np.arange(100, 200.0); np.empty(100, float); np.empty(100, float)
- It's tricky to get useful timing measurements for these routines! In a timeit loop, numpy.empty() can just keep reallocating the same one or two memory nodes. The time is independent of the array size. To prevent recycling:
  
  from timeit import timeit timeit('l.append(numpy.empty(100000))', 'import numpy; l = []') timeit('l.append(numpy.zeros(100000))', 'import numpy; l = []')
  
  but reducing that array size to numpy.zeros(10000) takes 15x as long; reducing it to numpy.zeros(1000) takes 1.3x as long (on my MBP). Puzzling.

See also: Hash values are salted in Python 3 and each dict preserves insertion order. That could vary the order of operations from run to run. [I'm wrangling with this problem in Python 2.7.15.]

You might be interested in https://github.com/numpy/numpy/issues/12394, and also https://github.com/numpy/numpy/issues/12465 — Warren Weckesser, Dec 05 '18 at 07:02
_"In my measurements, empty() takes about as long as zeros() [...]"_ That really depends. In a context of frequent dealloc/realloc (such as a simple `timeit`) `empty` can be orders of magnitude faster than `zeros`. And there are valid uses of `empty`, inversion of a permutation being just one example. — Paul Panzer, Dec 05 '18 at 07:50
Just a short anecdote. I ran into this exact problem recently with some code that was doing deterministic minimization in C. After what I thought should just be a cosmetic change I was getting slightly different results. It turned out the issue was that I had changed a line like `a *= b/c` to `a = a*b/c` which gave *very* slightly different results. Since this resulted in a slightly different result it caused the minimizer to try a different next point and the effect snowballed such that I was getting noticeably different results between the two commits. — user545424, Dec 05 '18 at 16:12
@PaulPanzer It turns out that `numpy.empty()` is zeroing out the arrays on my MBP. That explains the time similarity. Maybe that's a consequence of NumPy calling Apple's Accelerate framework, which is the default instead of `openblas`. I'll retest that with MKL. — Jerry101, Dec 05 '18 at 18:54
According to [this](https://stackoverflow.com/a/8029624/7207392) zeroing may happen at OS level. What do you get if in a plain (non ipython/jupyter) shell you type `np.arange(100)` and then twice `np.empty(100, int)`? — Paul Panzer, Dec 05 '18 at 19:29
Nice test, @PaulPanzer! The first `empty()` returned non-zero garbage values; the second result looks like the `arange()` memory recycled. Testing `arange(100, 200.0); empty(100, float); empty(100, float)`, both `empty()` calls returned those same arange values. [If the OS was zeroing the memory, `numpy.zeros()` would have to assume that or it'd be slower, yes?] — Jerry101, Dec 05 '18 at 19:50
My guess would be that `zeros` calls `calloc` and `calloc` being the one who decides whether to recycle or ask for new memory would indeed know whether it has to do the zeroing itself or can assume it's been done already. Mind you, I'm not a trained computer scientist. — Paul Panzer, Dec 05 '18 at 20:05

score 0 · Answer 1 · answered Dec 10 '18 at 00:17

I found that most (not all) of the non-determinism problems I'm experiencing seem to be fixed in the code for OpenBLAS 0.3.5.

A bunch of threading problems in earlier versions of OpenBLAS are fixed in release 0.3.4, but that release has a macOS compatibility bug that's fixed in the code for release 0.3.5. The bugs also occurs with Apple's Accelerate framework version 1.1 and Intel's MKL mkl==2019.0.

See how to install OpenBLAS and compile NumPy and SciPy on it.

Perhaps the remaining problems I'm experiencing are due to other libraries linked to Accelerate?

Note: I'm still open to more answers to this question.

Causes of floating point non-determinism? Including NumPy?

1 Answers1