10

How do I find the optimal chunk size for multiprocessing.Pool instances?

I used this before to create a generator of n sudoku objects:

processes = multiprocessing.cpu_count()
worker_pool = multiprocessing.Pool(processes)
sudokus = worker_pool.imap_unordered(create_sudoku, range(n), n // processes + 1)

To measure the time, I use time.time() before the snippet above, then I initialize the pool as described, then I convert the generator into a list (list(sudokus)) to trigger generating the items (only for time measurement, I know this is nonsense in the final program), then I take the time using time.time() again and output the difference.

I observed that the chunk size of n // processes + 1 results in times of around 0.425 ms per object. But I also observed that the CPU is only fully loaded the first half of the process, in the end the usage goes down to 25% (on an i3 with 2 cores and hyper-threading).

If I use a smaller chunk size of int(l // (processes**2) + 1) instead, I get times of around 0.355 ms instead and the CPU load is much better distributed. It just has some small spikes down to ca. 75%, but stays high for much longer part of the process time before it goes down to 25%.

Is there an even better formula to calculate the chunk size or a otherwise better method to use the CPU most effective? Please help me to improve this multiprocessing pool's effectiveness.

Byte Commander
  • 5,485
  • 3
  • 34
  • 59
  • Without the context to make good decisions, our only suggestions can be to time your options and do something sensible. – Veedrac Jan 25 '16 at 10:17

2 Answers2

14

This answer provides a high level overview.

Going into detais, each worker is sent a chunk of chunksize tasks at a time for processing. Every time a worker completes that chunk, it needs to ask for more input via some type of inter-process communication (IPC), such as queue.Queue. Each IPC request requires a system call; due to the context switch it costs anywhere in the range of 1-10 μs, let's say 10 μs. Due to shared caching, a context switch may hurt (to a limited extent) all cores. So extremely pessimistically let's estimate the maximum possible cost of an IPC request at 100 μs.

You want the IPC overhead to be immaterial, let's say <1%. You can ensure that by making chunk processing time >10 ms if my numbers are right. So if each task takes say 1 μs to process, you'd want chunksize of at least 10000.

The main reason not to make chunksize arbitrarily large is that at the very end of the execution, one of the workers might still be running while everyone else has finished -- obviously unnecessarily increasing time to completion. I suppose in most cases a delay of 10 ms is a not a big deal, so my recommendation of targeting 10 ms chunk processing time seems safe.

Another reason a large chunksize might cause problems is that preparing the input may take time, wasting workers capacity in the meantime. Presumably input preparation is faster than processing (otherwise it should be parallelized as well, using something like RxPY). So again targeting the processing time of ~10 ms seems safe (assuming you don't mind startup delay of under 10 ms).

Note: the context switches happen every ~1-20 ms or so for non-real-time processes on modern Linux/Windows - unless of course the process makes a system call earlier. So the overhead of context switches is no more than ~1% without system calls. Whatever overhead you're creating due to IPC is in addition to that.

Community
  • 1
  • 1
max
  • 40,904
  • 40
  • 170
  • 328
  • *"So if each task takes say 1 μs to process, you'd want chunksize of at least 10000."* But what if I have 8 processors and 10000 tasks to complete? Specifying a chunksize of 10000 would result in 7 processors being unused. Could that be ideal? – Booboo Oct 19 '20 at 20:43
  • In this case, the entire job will be finished in 10 ms, even on a single processor. If you want to optimize it, it's probably because you'll have lots of such jobs coming in. If so, just let each of those jobs hit a different processor. That would result in a far easier and far more efficient parallelization than trying to split the 10 ms job between 8 processors. If my assumption is incorrect, and do you do have a very occasional 10 ms job that you want to parallelize, python isn't the right tool: various python overheads on such a short job will be punishing. – max Oct 24 '20 at 02:19
  • I would then say that in my case a pool size of 1 (or do the work in the main process) would be the way to go for it would be pointless to undergo the overhead of creating processes that are never used. For what it's worth, on my Windows desktop with 8 cores, a *really* trivial worker function that takes .1μs and a pool size of 8 running 10000 tasks, the time difference between chunksizes of 10000 and 313 was nil probably due to the time to create the processes dominating. With a pool size of 1, performance was greatly improved for both chunksizes, but still the time difference was negligible. – Booboo Oct 24 '20 at 12:02
2

Nothing will replace the actual time measurements. I wouldn't bother with a formula and try a constant such as 1, 10, 100, 1000, 10000 instead and see what works best in your case.

jfs
  • 346,887
  • 152
  • 868
  • 1,518