Huge speed difference in numpy between similar code

Question

Why is there such a large speed difference between the following L2 norm calculations:

a = np.arange(1200.0).reshape((-1,3))

%timeit [np.sqrt((a*a).sum(axis=1))]
100000 loops, best of 3: 12 µs per loop

%timeit [np.sqrt(np.dot(x,x)) for x in a]
1000 loops, best of 3: 814 µs per loop

%timeit [np.linalg.norm(x) for x in a]
100 loops, best of 3: 2 ms per loop

All three produce identical results as far as I can see.

Here's the source code for numpy.linalg.norm function:

x = asarray(x)

# Check the default case first and handle it immediately.
if ord is None and axis is None:
    x = x.ravel(order='K')
    if isComplexType(x.dtype.type):
        sqnorm = dot(x.real, x.real) + dot(x.imag, x.imag)
    else:
        sqnorm = dot(x, x)
    return sqrt(sqnorm)

EDIT: Someone suggested that one version could be parallelized, but I checked and it's not the case. All three versions consume 12.5% of CPU (as is usually the case with Python code on my 4 physical / 8 virtual core Xeon CPU.

A couple more to time: `[math.sqrt(np.dot(x,x)) for x in a]`, `np.sqrt(np.einsum('ij,ij->i',a,a))` — hpaulj, Aug 18 '15 at 23:35
The main difference is in what is being done in interpreted Python code, and what is done in compiled C code. — hpaulj, Aug 18 '15 at 23:38
One thing I noticed is that the first method gives results with much smaller precision than the others. For example the final number produced by the first method is 2074.9973494 while that of the last two is 2074.9973493958973. — , Aug 18 '15 at 23:38
Tris Nefzger, I checked the results with dtype, in all three cases it's float64. — MichaelSB, Aug 18 '15 at 23:52
Lists and arrays have different rules for displaying the significant figures of floats. So the display doesn't tell you much of the float type. — hpaulj, Aug 19 '15 at 01:43

ali_m · Accepted Answer · 2015-08-19T16:17:37.820

4

np.dot will usually call a BLAS library function - its speed will thus depend on which BLAS library your version of numpy is linked against. In general I would expect it to have a greater constant overhead, but to scale much better as the array size increases. However, the fact that you are calling it from within a list comprehension (effectively a normal Python for loop) will likely negate any performance benefits of using BLAS.

If you get rid of the list comprehension and use the axis= kwarg, np.linalg.norm is comparable to your first example, but np.einsum is much faster than both:

In [1]: %timeit np.sqrt((a*a).sum(axis=1))
The slowest run took 10.12 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 11.1 µs per loop

In [2]: %timeit np.linalg.norm(a, axis=1)
The slowest run took 14.63 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 13.5 µs per loop

# this is what np.linalg.norm does internally
In [3]: %timeit np.sqrt(np.add.reduce(a * a, axis=1))
The slowest run took 34.05 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 10.7 µs per loop

In [4]: %timeit np.sqrt(np.einsum('ij,ij->i',a,a))
The slowest run took 5.55 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 5.42 µs per loop

edited Aug 19 '15 at 16:17

answered Aug 18 '15 at 23:45

ali_m

62,795
16
193
270

I'm curious, what changes under the hood when we replace Python list comprehension with axis=1 argument? Or, why is Python "for loop" is slower than C "for loop"? – MichaelSB Aug 19 '15 at 22:22
To answer your first question, when you use vectorization (i.e. `axis=1` rather than the list comp) numpy essentially does the loop over array elements at the level of C code rather than in Python. As for why looping in C is faster than Python... that's quite a big question. The simple answer is that slower runtime performance is the price you pay in exchange for Python's nice high-level language features like type checking, automatic garbage collection etc. (see [this Q/A](http://stackoverflow.com/q/3033329/1461210) for a bit more detail). – ali_m Aug 19 '15 at 22:36
By vectorization do you mean Intel AVX? Regarding Python vs C, it's not very clear to me: is Python bytecode still doing all that checking/garbage collection/etc, or is it only done once during the compilation to the bytecode? If the bytecode does not have that overhead, what makes it slower than the native machine code? – MichaelSB Aug 20 '15 at 00:53
In numpy-land we tend to use "vectorization" loosely to refer to [array programming](https://en.wikipedia.org/wiki/Array_programming). The bytecode produced by the Python compiler is not the same sort of thing as native instructions that C etc. compile down to, but rather it's a set of instructions for a virtual machine which does the type checking, garbage collection etc. at runtime. There other Python implementations besides the standard CPython that attempt various runtime optimizations (for example you may have heard of PyPy, which uses an optimizing JIT compiler). – ali_m Aug 20 '15 at 20:22
This is all goes *way* beyond the scope of what I can reasonably answer within the comments on an SE question. You can find some more details about the CPython internals [here](http://akaptur.com/blog/categories/python-internals/) and [here](http://pgbovine.net/cpython-internals.htm) – ali_m Aug 20 '15 at 20:25

Huge speed difference in numpy between similar code

1 Answers1

Linked