0

I have successfully used Armadillo coupled with OpenBLAS in master's thesis on Ubuntu 14.04 64bit (both with Armadillo installed and without installation). The performance was very impressive - my code consisted mainly from basic matrix operations. All of these were carried out using all threads available.

Now I try to use Armadillo with OpenBLAS on Windows 7 64bit machine in Visual Studio 2013. I have found some help online and successfully added PThread library. The code itself works, but the performance is poor. I test three basic operations using 1000x1000 matrix - addition, multiplication and element-wise multiplication. Out of these three, only classical multiplication uses all the CPU power. The other two use 25% CPU, which indicates they run on single thread.

I have not encoutered this behavior in case of Ubuntu. Does anyone have any suggestion? I haven't seen any link, where someone had similar issue.

Jan91
  • 3
  • 2
  • As Janneb points out, linear time algorithms like matrix addition and point-wise multiplication are typically bandwidth limited due to the fact that the number of arithmetic operations are on the same order as the number of I/O operations (loads and stores). What you call classical multiplication (matrix multiply or GEMM in BLAS-speak) is an O(n^3) operation and there's plenty of meat there to get multiple processors working on the problem without having the I/O dominate the time. – P. Hinker Oct 18 '15 at 04:07
  • Thank you for your additional input, you made things a little bit more clear for me. – Jan91 Oct 19 '15 at 08:20

1 Answers1

1

Are you sure that OpenBLAS is using multiple threads on Ubuntu for addition and element-wise multiplication? Intuitively I'd expect those operations to be BW-limited rather than FPU-limited, so I'd guess multithreading wouldn't help that much?

janneb
  • 32,371
  • 2
  • 74
  • 90
  • I have to say that I had the impression that these operations run on all threads, since my master thesis program used 100% CPU all the time and there were mainly matrix additions and element-wise multiplications. Testing these operations separately, I found out that they indeed use only one thread. I will try to make these parallel via OpenMP and see, if I can get faster code, just out of curiosity. – Jan91 Oct 17 '15 at 13:08
  • You were right, multithreading is no help. Since I am quite new to these libraries (used to mostly work in Matlab and did not care about these things that much), I still have a lot to learn. Thank you for you help. – Jan91 Oct 19 '15 at 08:22
  • @janneb. Can you tell me what is BW-limited and FPU-limited? – CKM Dec 27 '15 at 08:31
  • BW-limited - limited by memory bandwidth, FPU-limited - limited by floating point unit of CPU, so by the speed of floating point operations. – PKua May 20 '20 at 16:31