How I can parallelize computations in a fasta file, where each processor takes one sequence

Question

I don't know how to parallelise a code in Python that takes each line of a FASTA file and makes some statistics, like compute GC content, of it. Do you have some tips or libraries that will help me to decrease the time spent in execution?

I've tried to use os.fork(), but it gives me more execution time than the sequential code. Probably is due to I don't know very well how to give each child a different sequence.

#Computing GC Content
from Bio import SeqIO                  
with open('chr1.fa', 'r') as f:
    records = list (SeqIO.parse(f,'fasta'))
    GC_for_sequence=[]
    for i in records:
        GC=0
        for j in i:
            if j in "GC":
                GC+=1
        GC_for_sequence.append(GC/len(i))
    print(GC_for_sequence)

The expected execution would be: Each process takes one sequence, and they do the statistics in parallel.

score 1 · Answer 1 · answered Jan 11 '19 at 18:49

a few notes on your existing code to start with:

I'd suggest not doing: list (SeqIO.parse(…)) as that will pause execution until all sequences have been loaded in memory, you're much better off (memory and total execution time) just leaving it as an iterator and consuming elements off to workers as needed
looping over each character is pretty slow, using str.count is going to be much faster

putting this together, you can do:

from Bio import SeqIO

with open('chr1.fa') as fd:
    gc_for_sequence=[]
    for seq in SeqIO.parse(fd, 'fasta'):
        gc = sum(seq.seq.count(base) for base in "GC")
        gc_for_sequence.append(gc / len(seq))

if this still isn't fast enough, then you can use the multiprocessing module like:

from Bio import SeqIO
from multiprocessing import Pool

def sequence_gc_prop(seq):
    return sum(seq.count(base) for base in "GC") / len(seq)

with open('chr1.fa') as fd, Pool() as pool:
    gc_for_sequence = pool.map(
        sequence_gc_prop,
        (seq.seq for seq in SeqIO.parse(fd, 'fasta')),
        chunksize=1000,
    )

comments from Lukasz mostly apply. other non-obvious stuff:

the weird seq.seq for seq in… stuff is to make sure that we're not Pickling unnecessary data
I'm setting chunksize to quite a large value because the function should be quick, hence we want to give children a reasonable amount of work to do so the parent process doesn't spend all its time orchestrating things

Thank you! So, in the input data inside the pool.map function (seq.seq for seq in SeqIO.parse(fd, 'fasta')) the module takes every line of my file and calculates the GC content in a parallel way? — Sergi Aguiló, Jan 14 '19 at 09:35
should do! you could run `top` at the same time to make sure it's actually running across multiple processors. at a guess: this sort of task isn't very suited to running in parallel, the amount of useful work that can be distributed to each processor is limited. meaning that the main process will be spending most of its time reading data and coordinating. reframing the problem would help, e.g. processing multiple files at once — Sam Mason, Jan 14 '19 at 11:34

Lukasz Tracewski · Answer 2 · 2019-01-11T18:06:01.777

0

Here's one idea with standard multiprocessing module:

from multiprocessing import Pool
import numpy as np

no_cores_to_use = 4

GC_for_sequence = [np.random.rand(100) for x in range(10)]

with Pool(no_cores_to_use) as pool:
    result = pool.map(np.average, GC_for_sequence)

print(result)

In the code I used numpy module to simulate a list with some content. pool.map takes function that you want to use on your data as first argument and data list as second. The function you can easily define yourself. By default, it should take a single argument. If you want to pass more, then use functools.partial.

[EDIT] Here's an example much closer to your problem:

from multiprocessing import Pool
import numpy as np

records = ['ACTGTCGCAGC' for x in range(10)]
no_cores_to_use = 4

def count(sequence):
    count = sequence.count('GC')
    return count

with Pool(no_cores_to_use) as pool:
    result = pool.map(count, records)

print(sum(result))

edited Jan 11 '19 at 18:06

answered Jan 11 '19 at 17:47

Lukasz Tracewski

8,572
2
26
40

OK! But If I want to put a large number of different sequences, do I need to put them in a list like the one that you have done in records? – Sergi Aguiló Jan 14 '19 at 09:31
From your code it looks like you start with a list ( `list (SeqIO.parse(f,'fasta'))` ), so there's no need to do anything extra. – Lukasz Tracewski Jan 14 '19 at 11:25
And the module would work if I add the data as a stdin making a "cat" of the file(in the shell)? – Sergi Aguiló Jan 14 '19 at 11:40
You could make it work, but it would be very inefficient. Check this answer: https://stackoverflow.com/questions/7654971/parsing-a-fasta-file-using-a-generator-python In short, either use explicit generator that is given there or use iterator `SeqIO.parse("filename", "fasta")` In your code you turn the generator into the list. Don't. Just pass the iterator to the `map`. – Lukasz Tracewski Jan 14 '19 at 12:51

How I can parallelize computations in a fasta file, where each processor takes one sequence

2 Answers2