Parallelize for-loop in python

Question

I have a simple set of code that runs Clustal Omega (a protein multiple sequence alignment program) from Python:

from Bio.Align.Applications import ClustalOmegaCommandline

segments = range(1, 9)
segments.reverse()

for segment in segments:
    in_file = '1.0 - Split FASTA Files/Segment %d.fasta' % segment
    out_file = '1.1 - Aligned FASTA Files/Segment %d Aligned.fasta' % segment
    distmat = '1.1 - Distmats/Segment %d Distmat.fasta' % segment

    cline = ClustalOmegaCommandline(infile=in_file, 
                                    outfile=out_file, 
                                    distmat_out=distmat, 
                                    distmat_full=True, 
                                    verbose=True,
                                    force=True)
    print cline
    cline()

I've done some informal tests at timing how long my multiple sequence alignments (MSAs) take. On average, each one takes 4 hours. To run all 8 one after another took me 32 hours in total. Therefore, that was my original intent in running it as a for loop - that I could let it run and not worry about it.

However, I did yet another informal test - I took the output from the printed cline, and copied-and-pasted it into 8 separate terminal windows spread across two computers, and ran the MSAs that way. On average, each one took about 8 hours or so... but because they were all running at the same time, it took me only 8 hours to get the results.

In some ways, I've discovered parallel processing! :D

But I'm now faced with the dilemma of how to get it running in Python. I've tried looking at the following SO posts, but I still cannot seem to wrap my head around how the multiprocessing module works.

List of posts:

Would anybody be kind enough to share how they would parallelize this loop? Many loops I do look similar to this loop, in which I perform some action on a file and write to another file, without ever needing to aggregate the results in memory. The specific difference I am facing is the need to do file I/O, rather than aggregate results from parallel runs of the loop.

Since the other posts already state multiple ways to achieve parallelization of a `for` loop, is there something *specific* that troubles you? — Michael Foukarakis, Feb 13 '14 at 10:42
None of the examples deal with file I/O, but instead deal with aggregating the results from parallel runs. At least, that's what I thought when I was reading through the posts. Please pardon my ignorance, I still consider myself a newbie to many computing concepts and to Python. — ericmjl, Feb 13 '14 at 10:49
If file I/O is what concerns you, please be *explicit* about that in the question. As it currently stands it is an exact duplicate of the questions you linked. However, AFAIK, file I/O isn't anything special. You'll just have to use some locks between processes to avoid meaningless output to the files and you are done. — Bakuriu, Feb 13 '14 at 11:04
@Bakuriu, I have done so in the post from the beginning. Please read the bottom paragraph. — ericmjl, Feb 13 '14 at 11:08
@ericmjl No it's not clear that the *sole* purpose of your question is how to handle parallel I/O. As it is written it looks like a casual remark on an irrelevant detail of your code. If your question is "How can I handle parallel output to a file?" then *write it*, don't just write "how can I parallelize this loop?" and, in a random place, put "the code happens to contain file I/O". — Bakuriu, Feb 13 '14 at 11:11
@Bakuriu, I took your recommendation and edited in that point as the last statement. On an different note, however, based solely on reading your comment, it took a while to digest what you were saying, as I felt you were coming off as arrogant and pushy, which defeats the need to be welcoming towards newbies and newcomers on SO. I hope we can exchange our ideas respectfully, and that you may read my comment that way too. — ericmjl, Feb 13 '14 at 12:46

score 3 · Accepted Answer · answered Feb 13 '14 at 10:59

Possibly the Joblib library is what you are looking for.

Let me give you an example of its use:

import time
from joblib import Parallel, delayed


def long_function():
    time.sleep(1)


REPETITIONS = 4
Parallel(n_jobs=REPETITIONS)(
    delayed(long_function)() for _ in range(REPETITIONS))

This code runs in 1 second, instead of 4 seconds.

Adapting your code looks like this (sorry, I can't test if this is correct):

from joblib import Parallel, delayed

from Bio.Align.Applications import ClustalOmegaCommandline


def run(segment):
    in_file = '1.0 - Split FASTA Files/Segment %d.fasta' % segment
    out_file = '1.1 - Aligned FASTA Files/Segment %d Aligned.fasta' % segment
    distmat = '1.1 - Distmats/Segment %d Distmat.fasta' % segment
    cline = ClustalOmegaCommandline(infile=in_file,
                                    outfile=out_file,
                                    distmat_out=distmat,
                                    distmat_full=True,
                                    verbose=True,
                                    force=True)
    print cline
    cline()


if __name__ == "__main__":
    segments = range(1, 9)
    segments.reverse()

    Parallel(n_jobs=len(segments)(
        delayed(run)(segment) for segment in segments)

thank you for this helpful post! May I ask, assuming this is run from the IPython notebook, would I take the section under `if __name__ == '__main__':`, put that in one cell, and run that cell? — ericmjl, Feb 13 '14 at 11:05
Yes, if the rest of the code is included in a previous cell that has been evaluated, then you can run it as you have described. — logc, Feb 13 '14 at 11:18
This does not work in my version of python..It says there is no such module. — wolfsatthedoor, Aug 21 '14 at 19:24
@robbieboy74: have you installed the Joblib library? It is not a standard module — logc, Aug 21 '14 at 19:28

score 2 · Answer 2 · answered Feb 13 '14 at 10:47

2

Instead of for segment in segments, write def f(segment) and then use multiprocessing.Pool().map(f, segments)

Figuring out how to put this in context is left as an exercise to the reader.

answered Feb 13 '14 at 10:47

Brave Sir Robin

1,048
6
9

I think using `imap` instead of `map` might be appropriate here as the runtimes are so long. – msvalkon Feb 13 '14 at 10:49

Parallelize for-loop in python

2 Answers2