Python multiprocessing not yielding the speed-up expected

Question

I am trying to optimize my code using Python's multiprocessing.Pool module, but I am not getting the speed-up results that I would logically expect.

The main method I am doing involves calculating matrix-vector products for a large number of vectors and a fixed large sparse matrix. Below is a toy example which performs what I need, but with random matrices.

import time
import numpy as np
import scipy.sparse as sp

def calculate(vector, matrix = None):
    for i in range(50):
        v = matrix.dot(vector)
    return v

if __name__ == '__main__':
    N = 1e6
    matrix = sp.rand(N, N, density = 1e-5, format = 'csr')
    t = time.time()
    res = []
    for i in range(10):
        res.append(calculate(np.random.rand(N), matrix = matrix))    
    print time.time() - t

The method terminates in about 30 seconds.

Now, since the calculation of each element of results does not depend on the results of any other calculation, it is natural to think that paralel calculation will speed up the process. The idea is to create 4 processes and if each does some of the calculations, then the time it takes for all the processes to complete should decrease by some factor around 4. To do this, I wrote the following code:

import time
import numpy as np
import scipy.sparse as sp
from multiprocessing import Pool
from functools import partial

def calculate(vector, matrix = None):
    for i in range(50):
        v = matrix.dot(vector)
    return v

if __name__ == '__main__':
    N = 1e6
    matrix = sp.rand(N, N, density = 1e-5, format = 'csr')

    t = time.time()
    input = []
    for i in range(10):
        input.append(np.random.rand(N))
    mp = partial(calculate, matrix = matrix)
    p = Pool(4)
    res = p.map(mp, input)
    print time.time() - t

My problem is that this code takes slightly above 20 seconds to run, so I did not even improve performance by a factor of 2! Even worse, the performance does not improve even if the pool contains 8 processes! Any idea why the speed-up is not happening?

Note: My actual method takes much longer, and the input vectors are stored in a file. If I split the file in 4 pieces and then run my script in a separate process for each file manually, each process terminates four times as quickly as it would for the whole file (as expected). I am confuzed why this speed-up (which is obviously possible) is not happening with multiprocessing.Pool

Edi: I have just found Multiprocessing.Pool makes Numpy matrix multiplication slower this question which may be related. I have to check, though.

Question: How many physical (not hyperthreaded) CPU cores does the system you are running this on have? — Klaus D., Oct 16 '14 at 08:16
@KlausD. Physically, I have `4` cores. That is why I manually split the file into `4`, not `8` pieces. — 5xum, Oct 16 '14 at 08:21
If you put some `time.time()` benchmarks inside your `calculate` method, you'll see the 50 `dot` calls take nearly 4 times longer than they do in the non-parallel case. It's not clear to me why because tools like `top` make it the non-parallel case is only using one CPU fully, whereas the parallel case makes it look like 4 CPUs are being fully used. — Amit Kumar Gupta, Oct 16 '14 at 09:17
@AmitKumarGupta Indeed, there seems to be something strange going on with `numpy` when I use multiple processes. — 5xum, Oct 16 '14 at 09:19
We can take `numpy` out of the equation and benchmark. Here's a pastebin with parallel and serial implementations that just do a bunch of arithmetic: http://pastebin.com/B3M6GZb8. To make them more similar, the series version uses the built-in `map` to contrast with the `p.map` in the parallel case. On my machine (with 4 CPU), the series case takes about `1.8s` per calculate for a total of about `22.8s`. The parallel cases for 1-4 workers take on average `1.75s`, `2.6s`, `2.6s` and `3.3s` per calculate, respectively, for totals of about `21s`, `16s`, `11.0s`, and `11.3s` respectively. — Amit Kumar Gupta, Oct 16 '14 at 09:41
What platform are you on? I know that multiprocessing under windows is a bit of a hack because there isn't a proper way to fork a process. My colleagues and I ended up ditching Pythons multiprocessing library and we rolled our own using ZMQ sockets to communicate. We aren't passing large numpy arrays around though so I guess your mileage may vary. In general though, launching explicit subprocesses rather than "forking" using the `multiprocessing` library has worked better for us. — three_pineapples, Oct 16 '14 at 10:20

score 0 · Answer 1 · edited May 23 '17 at 12:21

Try:

p = Pool(4)
for i in range(10):
    input = np.random.rand(N)
    p.apply_async(calculate, args=(input, matrix)) # perform function calculate as new process with arguments input and matrix

p.close()  
p.join() # wait for all processes to complete

I suspect that the "partial" object and map are resulting in a blocking behavior. (though I have never used partial, so I'm not familiar with it.)

"apply_async" (or "map_async") are multiprocessing methods that specifically do not block - (see: Python multiprocessing.Pool: when to use apply, apply_async or map?)

Generally, for "embarrassingly parallel problems" like this, apply_async works for me.

EDIT:

I tend to write results to MySQL databases when I'm done - the implementation I provided doesn't work if that's not your approach. "map" is probably the right answer if you want to use order in the list as your way of tracking which entry is which, but I remain suspicious of the "partial" objects.

Python multiprocessing not yielding the speed-up expected

1 Answers1