344

This is probably a trivial question, but how do I parallelize the following loop in python?

# setup output lists
output1 = list()
output2 = list()
output3 = list()

for j in range(0, 10):
    # calc individual parameter value
    parameter = j * offset
    # call the calculation
    out1, out2, out3 = calc_stuff(parameter = parameter)

    # put results into correct output list
    output1.append(out1)
    output2.append(out2)
    output3.append(out3)

I know how to start single threads in Python but I don't know how to "collect" the results.

Multiple processes would be fine too - whatever is easiest for this case. I'm using currently Linux but the code should run on Windows and Mac as-well.

What's the easiest way to parallelize this code?

Aaron Hall
  • 291,450
  • 75
  • 369
  • 312
memyself
  • 9,999
  • 12
  • 56
  • 99
  • One very easy solution to parallelize a `for` loop is not yet mentioned as an answer - this would be by simply decorating two functions by using the [`deco`](https://github.com/alex-sherman/deco) package – Tom Roden Oct 28 '20 at 08:42

15 Answers15

230

Using multiple threads on CPython won't give you better performance for pure-Python code due to the global interpreter lock (GIL). I suggest using the multiprocessing module instead:

pool = multiprocessing.Pool(4)
out1, out2, out3 = zip(*pool.map(calc_stuff, range(0, 10 * offset, offset)))

Note that this won't work in the interactive interpreter.

To avoid the usual FUD around the GIL: There wouldn't be any advantage to using threads for this example anyway. You want to use processes here, not threads, because they avoid a whole bunch of problems.

user124384
  • 339
  • 1
  • 7
  • 22
Sven Marnach
  • 483,142
  • 107
  • 864
  • 776
  • 75
    Since this is the chosen answer, is it possible to have a more comprehensive example? What are the arguments of `calc_stuff`? – Eduardo Pignatelli Apr 11 '18 at 15:28
  • 7
    @EduardoPignatelli Please just read the documentation of the `multiprocessing` module for more comprehensive examples. `Pool.map()` basically works like `map()`, but in parallel. – Sven Marnach Apr 11 '18 at 16:30
  • 4
    Is there a way to simply add in a tqdm loading bar to this structure of code? I've used tqdm(pool.imap(calc_stuff, range(0, 10 * offset, offset))) but I don't get a full loading bar graphic. – user8188120 Jul 05 '18 at 13:35
  • @user8188120 I've never heard of tqdm before, so sorry, I can't help with that. – Sven Marnach Jul 06 '18 at 14:15
  • For a tqdm loading bar see this question: https://stackoverflow.com/questions/41920124/multiprocessing-use-tqdm-to-display-a-progress-bar – Johannes Jun 03 '19 at 08:42
  • To avoid anyone else falling into the trap I just did - instantiation of the pool and calling of `pool.map` needs to be inside a function: https://stackoverflow.com/questions/32995897/python-multiprocessing-pool-hangs-on-map-call – kabdulla Jan 07 '21 at 19:57
83

To parallelize a simple for loop, joblib brings a lot of value to raw use of multiprocessing. Not only the short syntax, but also things like transparent bunching of iterations when they are very fast (to remove the overhead) or capturing of the traceback of the child process, to have better error reporting.

Disclaimer: I am the original author of joblib.

Gael Varoquaux
  • 2,196
  • 1
  • 21
  • 11
  • 2
    I tried joblib with jupyter, it is not working. After the Parallel-delayed call, the page stopped working. – Jie May 23 '18 at 18:13
  • 1
    Hi, I have a problem using joblib (https://stackoverflow.com/questions/52166572/python-parallel-no-space-cant-pickle), do you have any clue what may be the cause? Thanks very much. – Ting Sun Sep 05 '18 at 03:08
  • Seems like something I want to give a shot! Is it possible to use it with a double loop e.g for i in range(10): for j in range(20) – CutePoison Apr 27 '20 at 09:07
82
from joblib import Parallel, delayed
def process(i):
    return i * i
    
results = Parallel(n_jobs=2)(delayed(process)(i) for i in range(10))
print(results)

The above works beautifully on my machine (Ubuntu, package joblib was pre-installed, but can be installed via pip install joblib).

Taken from https://blog.dominodatalab.com/simple-parallelization/


Edit on Mar 31, 2021: On joblib, multiprocessing, threading and asyncio

  • joblib in the above code uses import multiprocessing under the hood (and thus multiple processes, which is typically the best way to run CPU work across cores - because of the GIL)
  • You can let joblib use multiple threads instead of multiple processes, but this (or using import threading directly) is only beneficial if the threads spend considerable time on I/O (e.g. read/write to disk, send an HTTP request). For I/O work, the GIL does not block the execution of another thread
  • Since Python 3.7, as an alternative to threading, you can parallelise work with asyncio, but the same advice applies like for import threading (though in contrast to latter, only 1 thread will be used)
  • Using multiple processes incurs overhead. You need to check yourself if the above code snippet improves your wall time. Here is another one, for which I confirmed that joblib produces better results:
import time
from joblib import Parallel, delayed

def countdown(n):
    while n>0:
        n -= 1
    return n


t = time.time()
for _ in range(20):
    print(countdown(10**7), end=" ")
print(time.time() - t)  
# takes ~10.5 seconds on medium sized Macbook Pro


t = time.time()
results = Parallel(n_jobs=2)(delayed(countdown)(10**7) for _ in range(20))
print(results)
print(time.time() - t)
# takes ~6.3 seconds on medium sized Macbook Pro
tyrex
  • 5,602
  • 10
  • 28
  • 42
  • 5
    I tried your code but on my system the sequential version of this code takes about half a minute and the above parallel version takes 4 minutes. Why so? – shaifali Gupta Mar 13 '19 at 06:01
  • 4
    Thanks for your answer! I think this is the most elegant way to do this in 2019. – Heikki Pulkkinen Apr 26 '19 at 07:39
  • 2
    @tyrex thanks for sharing! this joblib package is great and the example works for me. Though, in a more complex context I had a bug unfortunately. https://github.com/joblib/joblib/issues/949 – Open Food Broker Oct 18 '19 at 10:43
  • 1
    Works on Windows Python 3.6, Doesn't work on Ubuntu Python 3.6 (runs but sequentially). – Hamza Dec 18 '19 at 05:33
  • 2
    @shaifaliGupta I think it really depends on how long your function processInput takes for each sample. If the time is short for each i, you will not see any improvement. I actually tried the code find out if the function processInput takes little time, then for-loops actually performs better. However, if your function processInput takes a long time to run. Using this parallel method is far more superior. – aysljc Jan 04 '20 at 19:49
  • 2
    this works, but for anyone trying to use this with windows and have output display through a jupyter notebook, you will run into the issues here https://stackoverflow.com/questions/55955330/printed-output-not-displayed-when-using-joblib-in-jupyter-notebook – spizwhiz Jul 23 '20 at 18:37
66

What's the easiest way to parallelize this code?

Use a PoolExecutor from concurrent.futures. Compare the original code with this, side by side. First, the most concise way to approach this is with executor.map:

...
with ProcessPoolExecutor() as executor:
    for out1, out2, out3 in executor.map(calc_stuff, parameters):
        ...

or broken down by submitting each call individually:

...
with ThreadPoolExecutor() as executor:
    futures = []
    for parameter in parameters:
        futures.append(executor.submit(calc_stuff, parameter))

    for future in futures:
        out1, out2, out3 = future.result() # this will block
        ...

Leaving the context signals the executor to free up resources

You can use threads or processes and use the exact same interface.

A working example

Here is working example code, that will demonstrate the value of :

Put this in a file - futuretest.py:

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
from time import time
from http.client import HTTPSConnection

def processor_intensive(arg):
    def fib(n): # recursive, processor intensive calculation (avoid n > 36)
        return fib(n-1) + fib(n-2) if n > 1 else n
    start = time()
    result = fib(arg)
    return time() - start, result

def io_bound(arg):
    start = time()
    con = HTTPSConnection(arg)
    con.request('GET', '/')
    result = con.getresponse().getcode()
    return time() - start, result

def manager(PoolExecutor, calc_stuff):
    if calc_stuff is io_bound:
        inputs = ('python.org', 'stackoverflow.com', 'stackexchange.com',
                  'noaa.gov', 'parler.com', 'aaronhall.dev')
    else:
        inputs = range(25, 32)
    timings, results = list(), list()
    start = time()
    with PoolExecutor() as executor:
        for timing, result in executor.map(calc_stuff, inputs):
            # put results into correct output list:
            timings.append(timing), results.append(result)
    finish = time()
    print(f'{calc_stuff.__name__}, {PoolExecutor.__name__}')
    print(f'wall time to execute: {finish-start}')
    print(f'total of timings for each call: {sum(timings)}')
    print(f'time saved by parallelizing: {sum(timings) - (finish-start)}')
    print(dict(zip(inputs, results)), end = '\n\n')

def main():
    for computation in (processor_intensive, io_bound):
        for pool_executor in (ProcessPoolExecutor, ThreadPoolExecutor):
            manager(pool_executor, calc_stuff=computation)

if __name__ == '__main__':
    main()

And here's the output for one run of python -m futuretest:

processor_intensive, ProcessPoolExecutor
wall time to execute: 0.7326343059539795
total of timings for each call: 1.8033506870269775
time saved by parallelizing: 1.070716381072998
{25: 75025, 26: 121393, 27: 196418, 28: 317811, 29: 514229, 30: 832040, 31: 1346269}

processor_intensive, ThreadPoolExecutor
wall time to execute: 1.190223217010498
total of timings for each call: 3.3561410903930664
time saved by parallelizing: 2.1659178733825684
{25: 75025, 26: 121393, 27: 196418, 28: 317811, 29: 514229, 30: 832040, 31: 1346269}

io_bound, ProcessPoolExecutor
wall time to execute: 0.533886194229126
total of timings for each call: 1.2977914810180664
time saved by parallelizing: 0.7639052867889404
{'python.org': 301, 'stackoverflow.com': 200, 'stackexchange.com': 200, 'noaa.gov': 301, 'parler.com': 200, 'aaronhall.dev': 200}

io_bound, ThreadPoolExecutor
wall time to execute: 0.38941240310668945
total of timings for each call: 1.6049387454986572
time saved by parallelizing: 1.2155263423919678
{'python.org': 301, 'stackoverflow.com': 200, 'stackexchange.com': 200, 'noaa.gov': 301, 'parler.com': 200, 'aaronhall.dev': 200}

Processor-intensive analysis

When performing processor intensive calculations in Python, expect the ProcessPoolExecutor to be more performant than the ThreadPoolExecutor.

Due to the Global Interpreter Lock (a.k.a. the GIL), threads cannot use multiple processors, so expect the time for each calculation and the wall time (elapsed real time) to be greater.

IO-bound analysis

On the other hand, when performing IO bound operations, expect ThreadPoolExecutor to be more performant than ProcessPoolExecutor.

Python's threads are real, OS, threads. They can be put to sleep by the operating system and reawakened when their information arrives.

Final thoughts

I suspect that multiprocessing will be slower on Windows, since Windows doesn't support forking so each new process has to take time to launch.

You can nest multiple threads inside multiple processes, but it's recommended to not use multiple threads to spin off multiple processes.

If faced with a heavy processing problem in Python, you can trivially scale with additional processes - but not so much with threading.

Aaron Hall
  • 291,450
  • 75
  • 369
  • 312
  • does ThreadPoolExecutor bypass the limitations imposed by GIL? also wouldnt you need to join() in order to wait for the executors to finish or is this taken care of implicitly inside the context manager – PirateApp Apr 19 '18 at 05:05
  • 1
    No and no, yes to "handled implicitly" – Aaron Hall Apr 19 '18 at 14:59
  • For some reason, when scaling up the problem, multithreading is extremely fast, but multiprocessing spawns a bunch of stuck processes (in macOS). Any idea why that could be? The process contains just nested loops and math, nothing exotic. – komodovaran_ Jan 16 '19 at 14:39
  • 1
    @komodovaran_ A process is a full Python process, one per each, while a thread is just a thread of execution with its own stack that shares the process, its bytecode and everything else it has in memory with all the other threads - does that help? – Aaron Hall Jan 16 '19 at 15:01
  • thank you for actually providing a fully working example – thistleknot Jan 29 '21 at 13:33
34

This is the easiest way to do it!

You can use asyncio. (Documentation can be found here). It is used as a foundation for multiple Python asynchronous frameworks that provide high-performance network and web-servers, database connection libraries, distributed task queues, etc. Plus it has both high-level and low-level APIs to accomodate any kind of problem.

import asyncio

def background(f):
    def wrapped(*args, **kwargs):
        return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)

    return wrapped

@background
def your_function(argument):
    #code

Now this function will be run in parallel whenever called without putting main program into wait state. You can use it to parallelize for loop as well. When called for a for loop, though loop is sequential but every iteration runs in parallel to the main program as soon as interpreter gets there. For instance:

@background
def your_function(argument):
    time.sleep(5)
    print('function finished for '+str(argument))


for i in range(10):
    your_function(i)


print('loop finished')

This produces following output:

loop finished
function finished for 4
function finished for 8
function finished for 0
function finished for 3
function finished for 6
function finished for 2
function finished for 5
function finished for 7
function finished for 9
function finished for 1
Hamza
  • 3,028
  • 2
  • 17
  • 29
  • 1
    Thank you! I agree that this is the easiest way to do it – mikey Nov 03 '20 at 11:44
  • Imagine you have different prints in your_function(), is there a way to force it to execute all prints then pass to the next i in the for loop ? – BAKYAC Mar 11 '21 at 10:57
20

There are a number of advantages to using Ray:

  • You can parallelize over multiple machines in addition to multiple cores (with the same code).
  • Efficient handling of numerical data through shared memory (and zero-copy serialization).
  • High task throughput with distributed scheduling.
  • Fault tolerance.

In your case, you could start Ray and define a remote function

import ray

ray.init()

@ray.remote(num_return_vals=3)
def calc_stuff(parameter=None):
    # Do something.
    return 1, 2, 3

and then invoke it in parallel

output1, output2, output3 = [], [], []

# Launch the tasks.
for j in range(10):
    id1, id2, id3 = calc_stuff.remote(parameter=j)
    output1.append(id1)
    output2.append(id2)
    output3.append(id3)

# Block until the results have finished and get the results.
output1 = ray.get(output1)
output2 = ray.get(output2)
output3 = ray.get(output3)

To run the same example on a cluster, the only line that would change would be the call to ray.init(). The relevant documentation can be found here.

Note that I'm helping to develop Ray.

Robert Nishihara
  • 2,308
  • 11
  • 15
  • 7
    For anyone considering ray, it may be relevant to know it does not natively support Windows. Some hacks to get it to work in Windows using WSL (Windows Subsystem for Linux) are possible, though it's hardly out-the-box if you want to use Windows. – OscarVanL Feb 13 '20 at 16:04
  • 1
    Sadly it doesn't support Python 3.9 yet. – adonig Apr 11 '21 at 22:16
7

I found joblib is very useful with me. Please see following example:

from joblib import Parallel, delayed
def yourfunction(k):   
    s=3.14*k*k
    print "Area of a circle with a radius ", k, " is:", s

element_run = Parallel(n_jobs=-1)(delayed(yourfunction)(k) for k in range(1,10))

n_jobs=-1: use all available cores

miuxu
  • 135
  • 1
  • 3
  • 20
    You know, it is better to check already existing answers before posting your own. [This answer](https://stackoverflow.com/a/50926231/9609843) also proposes to use `joblib`. – sanyassh Mar 28 '19 at 14:24
5

why dont you use threads, and one mutex to protect one global list?

import os
import re
import time
import sys
import thread

from threading import Thread

class thread_it(Thread):
    def __init__ (self,param):
        Thread.__init__(self)
        self.param = param
    def run(self):
        mutex.acquire()
        output.append(calc_stuff(self.param))
        mutex.release()   


threads = []
output = []
mutex = thread.allocate_lock()

for j in range(0, 10):
    current = thread_it(j * offset)
    threads.append(current)
    current.start()

for t in threads:
    t.join()

#here you have output list filled with data

keep in mind, you will be as fast as your slowest thread

jackdoe
  • 1,766
  • 1
  • 12
  • 12
  • 8
    I know this is a very old answer, so it's a bummer to get a random downvote out of nowhere. I only downvoted because threads won't parallelize anything. Threads in Python are bound to only one thread executing on the interpreter at a time because of the global interpreter lock, so they support [concurrent programming, but not parallel](http://stackoverflow.com/q/1897993/2615940) as OP is requesting. – skrrgwasme Mar 03 '17 at 07:12
  • 7
    @skrrgwasme I know you know this, but when you use the words "they won't parallelize anything", that might mislead readers. If the operations take a long time because they are IO bound, or sleeping while they wait for an event, then the interpreter is freed up to run the other threads, so this will result in the speed increase people are hoping for in those cases. Only CPU bound threads are really affected by what skrrgwasme says. – Jonathan Hartley Sep 11 '17 at 16:23
3

Let's say we have an async function

async def work_async(self, student_name: str, code: str, loop):
"""
Some async function
"""
    # Do some async procesing    

That needs to be run on a large array. Some attributes are being passed to the program and some are used from property of dictionary element in the array.

async def process_students(self, student_name: str, loop):
    market = sys.argv[2]
    subjects = [...] #Some large array
    batchsize = 5
    for i in range(0, len(subjects), batchsize):
        batch = subjects[i:i+batchsize]
        await asyncio.gather(*(self.work_async(student_name,
                                           sub['Code'],
                                           loop)
                       for sub in batch))
Amit Teli
  • 737
  • 8
  • 21
3

thanks @iuryxavier

from multiprocessing import Pool
from multiprocessing import cpu_count


def add_1(x):
    return x + 1

if __name__ == "__main__":
    pool = Pool(cpu_count())
    results = pool.map(add_1, range(10**12))
    pool.close()  # 'TERM'
    pool.join()   # 'KILL'
  • 4
    -1. This is a code-only answer. I'd suggest adding an explanation that tells readers what the code you've posted does, and perhaps where they can locate additional information. – starbeamrainbowlabs Dec 12 '19 at 15:45
2

This could be useful when implementing multiprocessing and parallel/ distributed computing in Python.

YouTube tutorial on using techila package

Techila is a distributed computing middleware, which integrates directly with Python using the techila package. The peach function in the package can be useful in parallelizing loop structures. (Following code snippet is from the Techila Community Forums)

techila.peach(funcname = 'theheavyalgorithm', # Function that will be called on the compute nodes/ Workers
    files = 'theheavyalgorithm.py', # Python-file that will be sourced on Workers
    jobs = jobcount # Number of Jobs in the Project
    )
TEe
  • 49
  • 2
  • 1
    While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. – S.L. Barth Oct 22 '15 at 09:29
  • 3
    @S.L.Barth thank you for the feedback. I added a small sample code to the answer. – TEe Oct 22 '15 at 12:26
1

Dask futures; I'm surprised no one has mentioned it yet . . .

from dask.distributed import Client

client = Client(n_workers=8) # In this example I have 8 cores and processes (can also use threads if desired)

def my_function(i):
    output = <code to execute in the for loop here>
    return output

futures = []

for i in <whatever you want to loop across here>:
    future = client.submit(my_function, i)
    futures.append(future)

results = client.gather(futures)
client.close()
itwasthekix
  • 316
  • 2
  • 6
1

Have a look at this;

http://docs.python.org/library/queue.html

This might not be the right way to do it, but I'd do something like;

Actual code;

from multiprocessing import Process, JoinableQueue as Queue 

class CustomWorker(Process):
    def __init__(self,workQueue, out1,out2,out3):
        Process.__init__(self)
        self.input=workQueue
        self.out1=out1
        self.out2=out2
        self.out3=out3
    def run(self):
            while True:
                try:
                    value = self.input.get()
                    #value modifier
                    temp1,temp2,temp3 = self.calc_stuff(value)
                    self.out1.put(temp1)
                    self.out2.put(temp2)
                    self.out3.put(temp3)
                    self.input.task_done()
                except Queue.Empty:
                    return
                   #Catch things better here
    def calc_stuff(self,param):
        out1 = param * 2
        out2 = param * 4
        out3 = param * 8
        return out1,out2,out3
def Main():
    inputQueue = Queue()
    for i in range(10):
        inputQueue.put(i)
    out1 = Queue()
    out2 = Queue()
    out3 = Queue()
    processes = []
    for x in range(2):
          p = CustomWorker(inputQueue,out1,out2,out3)
          p.daemon = True
          p.start()
          processes.append(p)
    inputQueue.join()
    while(not out1.empty()):
        print out1.get()
        print out2.get()
        print out3.get()
if __name__ == '__main__':
    Main()

Hope that helps.

MerreM
  • 83
  • 7
0

The concurrent wrappers by the tqdm library are a nice way to parallelize longer-running code. tqdm provides feedback on the current progress and remaining time through a smart progress meter, which I find very useful for long computations.

Loops can be rewritten to run as concurrent threads through a simple call to thread_map, or as concurrent multi-processes through a simple call to process_map:

from tqdm.contrib.concurrent import thread_map, process_map


def calc_stuff(num, multiplier):
    import time

    time.sleep(1)

    return num, num * multiplier


if __name__ == "__main__":

    # let's parallelize this for loop:
    # results = [calc_stuff(i, 2) for i in range(64)]

    loop_idx = range(64)
    multiplier = [2] * len(loop_idx)

    # either with threading:
    results_threading = thread_map(calc_stuff, loop_idx, multiplier)

    # or with multi-processing:
    results_processes = process_map(calc_stuff, loop_idx, multiplier)
w-m
  • 8,513
  • 34
  • 43
-2

very simple example of parallel processing is

from multiprocessing import Process

output1 = list()
output2 = list()
output3 = list()

def yourfunction():
    for j in range(0, 10):
        # calc individual parameter value
        parameter = j * offset
        # call the calculation
        out1, out2, out3 = calc_stuff(parameter=parameter)

        # put results into correct output list
        output1.append(out1)
        output2.append(out2)
        output3.append(out3)

if __name__ == '__main__':
    p = Process(target=pa.yourfunction, args=('bob',))
    p.start()
    p.join()
ncopiy
  • 1,121
  • 8
  • 26
Adil Warsi
  • 387
  • 2
  • 6
  • 6
    There's no parallelism in the for loop here, you are just spawning a process that runs the whole loop; this is NOT what the OP intended. – Pepe Mandioca Jan 02 '20 at 19:48