performance of NumPy with different BLAS implementations

Question

I'm running an algorithm that is implemented in Python and uses NumPy. The most computationally expensive part of the algorithm involves solving a set of linear systems (i.e. a call to numpy.linalg.solve(). I came up with this small benchmark:

import numpy as np
import time

# Create two large random matrices
a = np.random.randn(5000, 5000)
b = np.random.randn(5000, 5000)

t1 = time.time()
# That's the expensive call:
np.linalg.solve(a, b)
print time.time() - t1

I've been running this on:

My laptop, a late 2013 MacBook Pro 15" with 4 cores at 2GHz (sysctl -n machdep.cpu.brand_string gives me Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz)
An Amazon EC2 c3.xlarge instance, with 4 vCPUs. Amazon advertises them as "High Frequency Intel Xeon E5-2680 v2 (Ivy Bridge) Processors"

Bottom line:

On the Mac it runs in ~4.5 seconds
On the EC2 instance it runs in ~19.5 seconds

I have tried it also on other OpenBLAS / Intel MKL based setups, and the runtime is always comparable to what I get on the EC2 instance (modulo the hardware config.)

Can anyone explain why the performance on Mac (with the Accelerate Framework) is > 4x better? More details about the NumPy / BLAS setup in each are provided below.

Laptop setup

numpy.show_config() gives me:

atlas_threads_info:
  NOT AVAILABLE
blas_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
    define_macros = [('NO_ATLAS_INFO', 3)]
atlas_blas_threads_info:
  NOT AVAILABLE
openblas_info:
  NOT AVAILABLE
lapack_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3']
    define_macros = [('NO_ATLAS_INFO', 3)]
atlas_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE

EC2 instance setup:

On Ubuntu 14.04, I installed OpenBLAS with

sudo apt-get install libopenblas-base libopenblas-dev

When installing NumPy, I created a site.cfg with the following contents:

[default]
library_dirs= /usr/lib/openblas-base

[atlas]
atlas_libs = openblas

numpy.show_config() gives me:

atlas_threads_info:
    libraries = ['lapack', 'openblas']
    library_dirs = ['/usr/lib']
    define_macros = [('ATLAS_INFO', '"\\"None\\""')]
    language = f77
    include_dirs = ['/usr/include/atlas']
blas_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/lib']
    language = f77
openblas_info:
    libraries = ['openblas']
    library_dirs = ['/usr/lib']
    language = f77
lapack_opt_info:
    libraries = ['lapack', 'openblas']
    library_dirs = ['/usr/lib']
    define_macros = [('ATLAS_INFO', '"\\"None\\""')]
    language = f77
    include_dirs = ['/usr/include/atlas']
openblas_lapack_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE

Haswell has 2x the raw compute of Ivybridge per cycle per core (due to inclusion of FMA). I wonder if your openblas was built without AVX support enabled? That would give another 2x. — Stephen Canon, Oct 23 '14 at 13:06
Sounds like it might be related to [this](http://stackoverflow.com/q/25346036/1461210). Can you check whether your EC2 instance is actually multithreading BLAS operations? — ali_m, Dec 19 '14 at 20:56

score 3 · Answer 1 · answered Jan 07 '15 at 01:14

3

The reason for this behavior could be that Accelerate uses multithreading, while the others don't.

Most BLAS implementations follow the environment variable OMP_NUM_THREADS to determine how many threads to use. I believe they only use 1 thread if not told otherwise explicitly. Accelerate's man page, however sounds like threading is turned on by default; it can be turned off by setting the environment variable VECLIB_MAXIMUM_THREADS.

To determine if this is really what's happening, try

export VECLIB_MAXIMUM_THREADS=1

before calling the Accelerate version, and

export OMP_NUM_THREADS=4

for the other versions.

Independent of whether this is really the reason, it's a good idea to always set these variables when you use BLAS to be sure you control what is going on.

answered Jan 07 '15 at 01:14

Elmar Peise

9,078
3
18
39

Linked to Accelerate, `VECLIB_MAXIMUM_THREADS` does affect `numpy.linalg.norm`'s performance. `scipy.linalg.norm` on the other hand is consistently slower and not affected by the variable which leads me to believe that it's not linked to Accelerate but instead uses reference LAPACK. – Elmar Peise Jan 03 '17 at 09:36
Thanks Elmar. Fwiw, [scipy.linalg.norm](https://github.com/scipy/scipy/blob/master/scipy/linalg/misc.py) does `if ord in (None, 2) and (a.ndim == 1): nrm2 = get_blas_funcs('nrm2')`; `norm` in [numpy.linalg.linalg](https://github.com/numpy/numpy/blob/master/numpy/linalg/linalg.py) says "# Immediately handle some default, simple, fast, and common cases". Altogether, too complex. – denis Jan 08 '17 at 11:19

performance of NumPy with different BLAS implementations

Laptop setup

EC2 instance setup:

1 Answers1

Linked