parsing a fasta file using a generator ( python )

Question

I am trying to parse a large fasta file and I am encountering out of memory errors. Some suggestions to improve the data handling would be appreciated. Currently the program correctly prints out the names however partially through the file I get a MemoryError

Here is the generator

def readFastaEntry( fp ):
    name = ""
    seq = ""
    for line in fp:
        if line.startswith( ">" ):
            tmp = []
            tmp.append( name )
            tmp.append( seq )
            name = line
            seq = ""
            yield tmp
        else:
            seq = seq.join( line )

and here is the caller stub more will be added after this part works

fp = open( sys.argv[1], 'r' )

for seq in readFastaEntry( fp ) :
    print seq[0]

For those not fimilar with the fasta format here is an example

>1 (PB2)
AATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCTCGCACTCGCGAGATAC
TCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCC
TGCACTCAGGATGAAGTGGATGATG
>2 (PB1)
AACCATTTGAATGGATGTCAATCCGACTTTACTTTTCTTGAAAGTTCCAGCGCAAAATGCCATAAGCACC
ACATTTCCCTATACTGGAGACCCTCC

each entry starts with a ">" stating the name etc then the next N lines are data. There is no defined ending of the data other than the next line having a ">" at the beginning.

The file size will greatly vary from a few hundred megs to GB sizes. The max sequence size should not be more that 1000-2000 bases ( characters, indeterminate number of lines ) — Lamar B, Oct 04 '11 at 23:25
Where are you getting the MemoryError? Please post the full traceback. — Michael Hoffman, Oct 04 '11 at 23:53

Hernan · Accepted Answer · 2011-10-05T08:37:09.027

15

Have you considered using BioPython. They have a sequence reader that can read fasta files. And if you are interested in coding one yourself, you can take a look at BioPython's code.

Edit: Code added

def read_fasta(fp):
    name, seq = None, []
    for line in fp:
        line = line.rstrip()
        if line.startswith(">"):
            if name: yield (name, ''.join(seq))
            name, seq = line, []
        else:
            seq.append(line)
    if name: yield (name, ''.join(seq))

with open('f.fasta') as fp:
    for name, seq in read_fasta(fp):
        print(name, seq)

edited Oct 05 '11 at 08:37

answered Oct 04 '11 at 23:09

Hernan

5,199
5
46
79

Thanks for the link to the actual code of the file parser. I will be able to modify that to work just fine. Do you have an idea why the current code does not work? I would like to know from a "what not to do next time" POV. – Lamar B Oct 04 '11 at 23:28
1

It is difficult to say why your code is failing but I would rewrite it the lines proposed by hughdbrown. Accumulate each chunk in a list, and only at the end, join the list into a string and return a tuple with name and sequence. Also, I do not know which python version are you using but Python3 uses much more memory for a string that Python 2.x. They are solving this in Python 3.3 http://www.python.org/dev/peps/pep-0393/. I have edited my answer with a possible solution. – Hernan Oct 05 '11 at 08:36
i really just can't get this concept. why "if name:" ? i can do this without a generator but the generator is going over my head. somebody please explain how the "if name" and yield work. if name? the name is None . how does this work?! – O.rka Oct 12 '14 at 10:06
when is name being assigned data? It yields after asking if the bool(name) == True . I don't understand when it is getting data – O.rka Oct 12 '14 at 10:33
Adding the 'yield' makes the function into an iterator -- you can call .next() on it and the next yield in the execution sequence will return some values (name and seq) to the parent scope. The two `if name:` statements are handling initial and final conditions for the parsing operation, with a `None` acting as a Boolean false value. The initial condition is that no seq data has been read -- a `>` is encountered. Once the next `>` is encountered, all of the first sequence will have been read in. Similarly, at the end of the file the last sequence loaded should be passed back. – samhiggins2001 Mar 15 '15 at 07:42
How does 'for line in fp' work? Don't you have to open a file and use readlines() before iterating through? – jukhamil Sep 20 '15 at 15:14
1

@jukhamil The file is opened in the `with`. Additionally, a file is its own iterator. Take a look at: https://docs.python.org/2/library/stdtypes.html#file.next – Hernan Sep 22 '15 at 02:43

score 7 · Answer 2 · answered Oct 05 '11 at 10:11

A pyparsing parser for this format is only a few lines long. See the annotations in the following code:

data = """>1 (PB2) 
AATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCTCGCACTCGCGAGATAC 
TCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCC 
TGCACTCAGGATGAAGTGGATGATG 
>2 (PB1) 
AACCATTTGAATGGATGTCAATCCGACTTTACTTTTCTTGAAAGTTCCAGCGCAAAATGCCATAAGCACC 
ACATTTCCCTATACTGGAGACCCTCC"""

from pyparsing import Word, nums, QuotedString, Combine, OneOrMore

# define some basic forms
integer = Word(nums)
key = QuotedString("(", endQuoteChar=")")

# sequences are "words" made up of the characters A, G, C, and T
# we want to match one or more of them, and have the parser combine
# them into a single string (Combine by default requires all of its
# elements to be adjacent within the input string, but we want to allow
# for the intervening end of lines, so we add adjacent=False)
sequence = Combine(OneOrMore(Word("AGCT")), adjacent=False)

# define the overall pattern to scan for - attach results names
# to each matched element
seqEntry = ">" + integer("index") + key("key") + sequence("sequence")

for seq,s,e in seqEntry.scanString(data):
    # just dump out the matched data
    print seq.dump()
    # could also access fields as seq.index, seq.key and seq.sequence

Prints:

['>', '1', 'PB2', 'AATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCTCGCACTCGCGAGATACTCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCCTGCACTCAGGATGAAGTGGATGATG']
- index: 1
- key: PB2
- sequence: AATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCTCGCACTCGCGAGATACTCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCCTGCACTCAGGATGAAGTGGATGATG
['>', '2', 'PB1', 'AACCATTTGAATGGATGTCAATCCGACTTTACTTTTCTTGAAAGTTCCAGCGCAAAATGCCATAAGCACCACATTTCCCTATACTGGAGACCCTCC']
- index: 2
- key: PB1
- sequence: AACCATTTGAATGGATGTCAATCCGACTTTACTTTTCTTGAAAGTTCCAGCGCAAAATGCCATAAGCACCACATTTCCCTATACTGGAGACCCTCC

score 0 · Answer 3 · answered Dec 23 '13 at 08:10

def read_fasta(filename):
    name = None
    with open(filename) as file:
        for line in file:
            if line[0] == ">":
                if name:
                    yield (name, seq)
                name = line[1:-1].split("|")[0]
                seq = ""
            else:
                seq += line[:-1]
        yield (name, seq)

score 0 · Answer 4 · answered Oct 05 '11 at 02:32

Without having a great understanding of what you are doing, I would have written the code like this:

def readFastaEntry( fp ):
    name = ""
    while True:
        line = name or f.readline()
        if not line:
            break
        seq = []
        while True:
            name = f.readline()
            if not name or name.startswith(">"):
                break
            else:
                seq.append(name)
        yield (line, "".join(seq))

This gathers up the data after a starting line up to the next starting line. Making seq an array means that you minimize the string joining until the last possible moment. Yielding a tuple makes more sense than a list.

parsing a fasta file using a generator ( python )

4 Answers4

Linked

Related