8

I found a strange performance difference while evaluating an expression in Numpy.

I executed the following code:

import numpy as np
myarr = np.random.uniform(-1,1,[1100,1100])

and then

%timeit np.exp( - 0.5 * (myarr / 0.001)**2 )
>> 184 ms ± 301 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

and

%timeit np.exp( - 0.5 * (myarr / 0.1)**2 )
>> 12.3 ms ± 34.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That's an almost 15x faster computation in the second case! Note that the only difference is the factor being 0.1 or 0.001.

What's the reason for this behaviour? Can I change something to make the first calculation as fast as the second?

Ethunxxx
  • 1,049
  • 2
  • 13
  • 27
  • On OSX and NumPy 1.15.3, Python 3.7, I see 11ms vs 13ms. – Nils Werner Nov 21 '18 at 15:47
  • In Win7, NumPy 1.15.3, Python 3.6.2, I see 227 vs 22 ms – Brenlla Nov 21 '18 at 15:49
  • @jpp I'm using Ubuntu 18.04, Python 3.6.6, Numpy 1.15.4 – Ethunxxx Nov 21 '18 at 15:51
  • 1
    OK, on Windows, NumPy 1.14.3, Python 3.6.0, I see 97.7ms vs 47.7ms. – jpp Nov 21 '18 at 15:53
  • 3
    In my system, `exp` of large (negative) numbers are slower: `exp(-1)` is faster than `exp(-1000)`. So it probably comes down to some slower covergence of the `exp` algorithm with large numbers – Brenlla Nov 21 '18 at 15:55
  • @Brenlla yes, that's because division is faster than multiplication. – Matt Messersmith Nov 21 '18 at 16:15
  • @MattMessersmith I mean literally exp(-1) is faster than exp(-1000), without multiplication / division involved – Brenlla Nov 21 '18 at 16:17
  • @Brenlla exp(-1) *does* division under the hood. It's equivalent to 1/exp(1). Division time depends on the size of the operand as well. – Matt Messersmith Nov 21 '18 at 16:19
  • 1
    @MattMessersmith Reasonable explanation, but nope. `exp(1)` is still much faster than `exp(1000)` – Brenlla Nov 21 '18 at 16:23
  • 1
    @Brenlla Typo: i mean division *slower* than multiplication. But a caveat is that *both* operations depend somewhat on the magnitude of the result: division much more so – Matt Messersmith Nov 21 '18 at 16:27
  • @Brenlla Also `exp(1)` is about the same as `exp(1000)` on my system – Matt Messersmith Nov 21 '18 at 16:28
  • You'll probably have to find the source code of the C function `exp` in the libc library provided by your platform. – Warren Weckesser Nov 21 '18 at 18:36
  • Note that in the slow case, most of the calculations result in *underflow* and return 0. On my system, I can see the difference quite clearly by timing `exp(a)` with `a = np.full(10000, fill_value=-708.0)` and then with `a = np.full(10000, fill_value=-709.0)`. – Warren Weckesser Nov 21 '18 at 18:58
  • 2
    My first guess (based on the title) was that there are some denormalized numbers involved - see https://stackoverflow.com/questions/36781881/why-denormalized-floats-are-so-much-slower-than-other-floats-from-hardware-arch I didn't verify this in all depth for the specific numpy/python setup, but they can be **awfully** slow... – Marco13 Nov 21 '18 at 19:53
  • I'm not seeing any substantial difference between these on my computer, although the first run of the first option was much slower probably due to caching. – user2699 Nov 21 '18 at 20:13
  • 2
    @Marco13, yes, in fact, `exp(-708)` is a normal float, and `exp(-709)` is denormal, and that's where I see (on Mac OS X) a big jump in execution time. Underflow to zero doesn't occur until about `exp(-746)`. – Warren Weckesser Nov 21 '18 at 20:48
  • You could use Numba (including SVML) or for such simple cases preferably Numexpr. Case 1 and Case 2 are equally fast and both are faster than Case 2. – max9111 Nov 22 '18 at 10:04
  • @max9111 Thanks for the suggestion. Actually, I first tried with Numexpr and got the same performance difference. I then checked the behaviour with pure numpy and then posted this question. – Ethunxxx Nov 22 '18 at 10:16
  • Oh, I made a mistake in the Numexpr timings, but nevertheless Numba works fine in both cases. This is really a weird behaiviour. Both around 3.6ms per call. – max9111 Nov 22 '18 at 10:54

2 Answers2

1

Use Intel SVML

I have no working numexpr with Intel SVML, but numexpr with working SVML should perform as good as Numba. The Numba Benchmarks show quite the same behaviour without SVML, but perform much better with SVML.

Code

import numpy as np
import numba as nb

myarr = np.random.uniform(-1,1,[1100,1100])

@nb.njit(error_model="numpy",parallel=True)
def func(arr,div):
  return np.exp( - 0.5 * (myarr / div)**2 )

Timings

#Core i7 4771
#Windows 7 x64
#Anaconda Python 3.5.5
#Numba 0.41 (compilation overhead excluded)
func(myarr,0.1)                      -> 3.6ms
func(myarr,0.001)                    -> 3.8ms

#Numba (set NUMBA_DISABLE_INTEL_SVML=1), parallel=True
func(myarr,0.1)                      -> 5.19ms
func(myarr,0.001)                    -> 12.0ms

#Numba (set NUMBA_DISABLE_INTEL_SVML=1), parallel=False
func(myarr,0.1)                      -> 16.7ms
func(myarr,0.001)                    -> 63.2ms

#Numpy (1.13.3), set OMP_NUM_THREADS=4
np.exp( - 0.5 * (myarr / 0.001)**2 ) -> 70.82ms
np.exp( - 0.5 * (myarr / 0.1)**2 )   -> 12.58ms

#Numpy (1.13.3), set OMP_NUM_THREADS=1
np.exp( - 0.5 * (myarr / 0.001)**2 ) -> 189.4ms
np.exp( - 0.5 * (myarr / 0.1)**2 )   -> 17.4ms

#Numexpr (2.6.8), no SVML, parallel
ne.evaluate("exp( - 0.5 * (myarr / 0.001)**2 )") ->17.2ms
ne.evaluate("exp( - 0.5 * (myarr / 0.1)**2 )")   ->4.38ms

#Numexpr (2.6.8), no SVML, single threaded
ne.evaluate("exp( - 0.5 * (myarr / 0.001)**2 )") ->50.85ms
ne.evaluate("exp( - 0.5 * (myarr / 0.1)**2 )")   ->13.9ms
Community
  • 1
  • 1
max9111
  • 5,288
  • 1
  • 11
  • 22
1

This may produce denormalised numbers which slow down computations.

You may like to disable denormalized numbers using daz library:

import daz
daz.set_daz()

More info: x87 and SSE Floating Point Assists in IA-32: Flush-To-Zero (FTZ) and Denormals-Are-Zero (DAZ):

To avoid serialization and performance issues due to denormals and underflow numbers, use the SSE and SSE2 instructions to set Flush-to-Zero and Denormals-Are-Zero modes within the hardware to enable highest performance for floating-point applications.

Note that in 64-bit mode floating point computations use SSE instructions, not x87.

Maxim Egorushkin
  • 119,842
  • 14
  • 147
  • 239