2

I have a huge list of elements which somehow must be processed. I know that it can be done with Process from multiprocessing by:

pr1 = Process(calculation_function, (args, ))
pr1.start()
pr1.join()

and so I can create lets say 10 processes and pass arguments split by 10 to args. And then job is done.

But I do not want to create it manually and calculate it manually. Instead I want to use ProcessPoolExecutor and I am doing it like this:

executor = ProcessPoolExecutor(max_workers=10)
executor.map(calculation, (list_to_process,))

calculation is my function which do the job.

def calculation(list_to_process):
    for element in list_to_process:
        # .... doing the job

list_to_process is my list to be processed.

But instead after running this code, iteration on loop goes just one time. I thought that

executor = ProcessPoolExecutor(max_workers=10)
executor.map(calculation, (list_to_process,))

is the same as this 10 times:

pr1 = Process(calculation, (list_to_process, ))
pr1.start()
pr1.join()

But it seems to be wrong.

How to achieve real multiprocessing by ProcessPoolExecutor?

John
  • 422
  • 6
  • 19

1 Answers1

3

Remove the for loop from your calculation function. Now that you're using ProcessPoolExecutor.map, that map() call is your loop, the difference being that each element in the list is sent to a different process. E.g.

def calculation(item):
    print('[pid:%s] performing calculation on %s' % (os.getpid(), item))
    time.sleep(5)
    print('[pid:%s] done!' % os.getpid())
    return item ** 2

executor = ProcessPoolExecutor(max_workers=5)
list_to_process = range(10)
result = executor.map(calculation, list_to_process)

You'll see something in the terminal like:

[pid:23988] performing calculation on 0
[pid:10360] performing calculation on 1
[pid:13348] performing calculation on 2
[pid:24032] performing calculation on 3
[pid:18028] performing calculation on 4
[pid:23988] done!
[pid:23988] performing calculation on 5
[pid:10360] done!
[pid:13348] done!
[pid:10360] performing calculation on 6
[pid:13348] performing calculation on 7
[pid:18028] done!
[pid:24032] done!
[pid:18028] performing calculation on 8
[pid:24032] performing calculation on 9
[pid:23988] done!
[pid:10360] done!
[pid:13348] done!
[pid:18028] done!
[pid:24032] done!

Though the order of events will be effectively random. The return value (at least in my Python version), is actually an itertools.chain object for some reason. But that's an implementation detail. You can return the result as a list like:

>>> list(result)
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In your example code you've instead passed a single-element tuple (list_to_process,) so that's just going to pass your full list to one process.

Iguananaut
  • 15,675
  • 4
  • 43
  • 50
  • Thank you for your reply! I do not fully understand.. Where iteration over list_to_process should be? So I must use one element from my list in for-loop – John Oct 21 '17 at 13:58
  • @John nowhere, `executor.map` allready iterates over each element in list and applies it as argument to calculation function – Yaroslav Surzhikov Oct 21 '17 at 13:59
  • As I explained, the iteration is being performed by `ProcessPoolExecutor.map()`. This is basically equivalent to: `for item in list_to_process: calculation(item)`, except that `calculation` may be called in a difference process for each item. – Iguananaut Oct 21 '17 at 14:00
  • Play around with the [`map`](https://docs.python.org/3/library/functions.html#map) built-in function and make sure you understand how that works. `ProcessPoolExecutor.map` is doing the same thing, but with each calculation being farmed out to a different process, and then the results gathered up in the correct order. – Iguananaut Oct 21 '17 at 14:01
  • What do you mean by "no iteration appears"? If your processes are all running then they're all producing results. If you want the final result your need to assign the return value of `executor.map` to a variable. I think the return value itself is an iterable type so you may have to wrap it in `list()` to get an actual `list` object. – Iguananaut Oct 21 '17 at 14:27
  • 1
    Thanks a ton for this one! Helped me big time. Also, a note for anyone in similiar situation. This "map()" with multiple iterables, the iterator stops when the shortest iterable is exhausted. So, if you have an argument that is goign to be constant for all loops, you'll need to refer to this one : https://stackoverflow.com/a/10834984/2408212 – Xonshiz Dec 07 '19 at 10:01
  • Something I also might have noted is that it's OK in some cases to have an inner loop in your mapped function as well. With multiprocessing one also has to consider the overhead involved in interprocess communication, of sending inputs to processes and retrieving results. Sometimes it can be more efficient to send arguments in batches (where your main process handles the batching). There's no hard and fast rule to this though and it may require experimentation. – Iguananaut Dec 08 '19 at 11:07
  • Typically you will find that as the number of processes scales up you'll get a non-linear efficiency due in part to this overhead, but you can improve the efficiency at higher numbers of processes with batching. If you're going to be doing a very large scale multiprocessing calculation it's good to experiment and try to optimize such hyperparameters as batch size. – Iguananaut Dec 08 '19 at 11:11