Armadillo + OpenBLAS multi-threading

Question

I have successfully used Armadillo coupled with OpenBLAS in master's thesis on Ubuntu 14.04 64bit (both with Armadillo installed and without installation). The performance was very impressive - my code consisted mainly from basic matrix operations. All of these were carried out using all threads available.

Now I try to use Armadillo with OpenBLAS on Windows 7 64bit machine in Visual Studio 2013. I have found some help online and successfully added PThread library. The code itself works, but the performance is poor. I test three basic operations using 1000x1000 matrix - addition, multiplication and element-wise multiplication. Out of these three, only classical multiplication uses all the CPU power. The other two use 25% CPU, which indicates they run on single thread.

I have not encoutered this behavior in case of Ubuntu. Does anyone have any suggestion? I haven't seen any link, where someone had similar issue.

As Janneb points out, linear time algorithms like matrix addition and point-wise multiplication are typically bandwidth limited due to the fact that the number of arithmetic operations are on the same order as the number of I/O operations (loads and stores). What you call classical multiplication (matrix multiply or GEMM in BLAS-speak) is an O(n^3) operation and there's plenty of meat there to get multiple processors working on the problem without having the I/O dominate the time. — P. Hinker, Oct 18 '15 at 04:07
Thank you for your additional input, you made things a little bit more clear for me. — Jan91, Oct 19 '15 at 08:20

score 1 · Accepted Answer · answered Oct 16 '15 at 10:51

1

Are you sure that OpenBLAS is using multiple threads on Ubuntu for addition and element-wise multiplication? Intuitively I'd expect those operations to be BW-limited rather than FPU-limited, so I'd guess multithreading wouldn't help that much?

answered Oct 16 '15 at 10:51

janneb

32,371
2
74
90

I have to say that I had the impression that these operations run on all threads, since my master thesis program used 100% CPU all the time and there were mainly matrix additions and element-wise multiplications. Testing these operations separately, I found out that they indeed use only one thread. I will try to make these parallel via OpenMP and see, if I can get faster code, just out of curiosity. – Jan91 Oct 17 '15 at 13:08
You were right, multithreading is no help. Since I am quite new to these libraries (used to mostly work in Matlab and did not care about these things that much), I still have a lot to learn. Thank you for you help. – Jan91 Oct 19 '15 at 08:22
@janneb. Can you tell me what is BW-limited and FPU-limited? – CKM Dec 27 '15 at 08:31
BW-limited - limited by memory bandwidth, FPU-limited - limited by floating point unit of CPU, so by the speed of floating point operations. – PKua May 20 '20 at 16:31

Armadillo + OpenBLAS multi-threading

1 Answers1