9

The description of the problem I am having is a bit complicated, and I will err on the side of providing more complete information. For the impatient, here is the briefest way I can summarize it:

What is the fastest (least execution time) way to split a text file in to ALL (overlapping) substrings of size N (bound N, eg 36) while throwing out newline characters.

I am writing a module which parses files in the FASTA ascii-based genome format. These files comprise what is known as the 'hg18' human reference genome, which you can download from the UCSC genome browser (go slugs!) if you like.

As you will notice, the genome files are composed of chr[1..22].fa and chr[XY].fa, as well as a set of other small files which are not used in this module.

Several modules already exist for parsing FASTA files, such as BioPython's SeqIO. (Sorry, I'd post a link, but I don't have the points to do so yet.) Unfortunately, every module I've been able to find doesn't do the specific operation I am trying to do.

My module needs to split the genome data ('CAGTACGTCAGACTATACGGAGCTA' could be a line, for instance) in to every single overlapping N-length substring. Let me give an example using a very small file (the actual chromosome files are between 355 and 20 million characters long) and N=8

>>>import cStringIO
>>>example_file = cStringIO.StringIO("""\
>header
CAGTcag
TFgcACF
""")
>>>for read in parse(example_file):
...    print read
...
CAGTCAGTF
AGTCAGTFG
GTCAGTFGC
TCAGTFGCA
CAGTFGCAC
AGTFGCACF

The function that I found had the absolute best performance from the methods I could think of is this:


def parse(file):
  size = 8 # of course in my code this is a function argument
  file.readline() # skip past the header
  buffer = ''
  for line in file:
    buffer += line.rstrip().upper()
    while len(buffer) >= size:
      yield buffer[:size]
      buffer = buffer[1:]

This works, but unfortunately it still takes about 1.5 hours (see note below) to parse the human genome this way. Perhaps this is the very best I am going to see with this method (a complete code refactor might be in order, but I'd like to avoid it as this approach has some very specific advantages in other areas of the code), but I thought I would turn this over to the community.

Thanks!

  • Note, this time includes a lot of extra calculation, such as computing the opposing strand read and doing hashtable lookups on a hash of approximately 5G in size.

Post-answer conclusion: It turns out that using fileobj.read() and then manipulating the resulting string (string.replace(), etc.) took relatively little time and memory compared to the remainder of the program, and so I used that approach. Thanks everyone!

eblume
  • 1,518
  • 2
  • 16
  • 21

4 Answers4

4

Could you mmap the file and start pecking through it with a sliding window? I wrote a stupid little program that runs pretty small:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
sarnold  20919  0.0  0.0  33036  4960 pts/2    R+   22:23   0:00 /usr/bin/python ./sliding_window.py

Working through a 636229 byte fasta file (found via http://biostar.stackexchange.com/questions/1759) took .383 seconds.

#!/usr/bin/python

import mmap
import os

  def parse(string, size):
    stride = 8
    start = string.find("\n")
    while start < size - stride:
        print string[start:start+stride]
        start += 1

fasta = open("small.fasta", 'r')
fasta_size = os.stat("small.fasta").st_size
fasta_map = mmap.mmap(fasta.fileno(), 0, mmap.MAP_PRIVATE, mmap.PROT_READ)
parse(fasta_map, fasta_size)
sarnold
  • 96,852
  • 21
  • 162
  • 219
  • Since the entire file is processed wouldn't you end up blowing out memory usage? – dietbuddha Jan 26 '11 at 07:32
  • @dietbuddha, that is a risk, on a 32-bit machine the largest file you could really process would be somewhere around two gigabytes, on a 64-bit machine you could of course go much larger. Hopefully the OS's page replacement algorithm can handle sequentially reading through an mmapped segment to help keep the resident set size small. – sarnold Jan 26 '11 at 07:40
  • @Fred Nurk, I forgot that I have a much larger fasta file sitting around: 242838416 bytes in 2m23.355s. 1.9 million strings of output every second. – sarnold Jan 26 '11 at 07:46
  • Much closer to a real test. Of course we can't compare wall clock times across different machines, but this data set does seem to be at the threshold of being large enough that it needs special consideration. Actually, I think this is the wrong way to look at this, and I've removed my earlier comment. – Fred Nurk Jan 26 '11 at 07:49
  • @Fred Nurk 30s for _his_ dataset sounds pretty impressive. :) I can't wait to find out what comes out best. :) – sarnold Jan 26 '11 at 07:52
  • I had misread the 30 s figure. Don't know what I was thinking. – Fred Nurk Jan 26 '11 at 07:53
  • Once you add logic to handle newlines and uppercasing as required, my tests (using chr21.fa from chromFa.zip – 48MB in size and just under 1M lines) show mmap as slightly slower than ordinary file handling. – Fred Nurk Jan 26 '11 at 11:04
  • @Fred Nurk: Ha! There's a lesson in there somewhere about trying to outsmart really smart people. :) Thanks! – sarnold Jan 26 '11 at 11:15
  • @sarnold: The code I used is in my answer, perhaps I've missed something. – Fred Nurk Jan 26 '11 at 11:44
  • Thanks for the suggestion, sarnold! I'm going to give this a shot but I don't have too high hopes for it as my understanding of mmap is that it really isn't much different at all than saying "".join(line for line in open(file)), except that the memory object gets stored in Shared Memory which has benefits for parallelization. I *am* parallelizing this problem, but across genome files to maximize IO rather than across a single genome file, so the benefit of mmap for IPC is lost. – eblume Jan 26 '11 at 17:23
3

I suspect the problem is that you have so much data stored in string format, which is really wasteful for your use case, that you're running out of real memory and thrashing swap. 128 GB should be enough to avoid this... :)

Since you've indicated in comments that you need to store additional information anyway, a separate class which references a parent string would be my choice. I ran a short test using chr21.fa from chromFa.zip from hg18; the file is about 48MB and just under 1M lines. I only have 1GB of memory here, so I simply discard the objects afterwards. This test thus won't show problems with fragmentation, cache, or related, but I think it should be a good starting point for measuring parsing throughput:

import mmap
import os
import time
import sys

class Subseq(object):
  __slots__ = ("parent", "offset", "length")

  def __init__(self, parent, offset, length):
    self.parent = parent
    self.offset = offset
    self.length = length

  # these are discussed in comments:
  def __str__(self):
    return self.parent[self.offset:self.offset + self.length]

  def __hash__(self):
    return hash(str(self))

  def __getitem__(self, index):
    # doesn't currently handle slicing
    assert 0 <= index < self.length
    return self.parent[self.offset + index]

  # other methods

def parse(file, size=8):
  file.readline()  # skip header
  whole = "".join(line.rstrip().upper() for line in file)
  for offset in xrange(0, len(whole) - size + 1):
    yield Subseq(whole, offset, size)

class Seq(object):
  __slots__ = ("value", "offset")
  def __init__(self, value, offset):
    self.value = value
    self.offset = offset

def parse_sep_str(file, size=8):
  file.readline()  # skip header
  whole = "".join(line.rstrip().upper() for line in file)
  for offset in xrange(0, len(whole) - size + 1):
    yield Seq(whole[offset:offset + size], offset)

def parse_plain_str(file, size=8):
  file.readline()  # skip header
  whole = "".join(line.rstrip().upper() for line in file)
  for offset in xrange(0, len(whole) - size + 1):
    yield whole[offset:offset+size]

def parse_tuple(file, size=8):
  file.readline()  # skip header
  whole = "".join(line.rstrip().upper() for line in file)
  for offset in xrange(0, len(whole) - size + 1):
    yield (whole, offset, size)

def parse_orig(file, size=8):
  file.readline() # skip header
  buffer = ''
  for line in file:
    buffer += line.rstrip().upper()
    while len(buffer) >= size:
      yield buffer[:size]
      buffer = buffer[1:]

def parse_os_read(file, size=8):
  file.readline()  # skip header
  file_size = os.fstat(file.fileno()).st_size
  whole = os.read(file.fileno(), file_size).replace("\n", "").upper()
  for offset in xrange(0, len(whole) - size + 1):
    yield whole[offset:offset+size]

def parse_mmap(file, size=8):
  file.readline()  # skip past the header
  buffer = ""
  for line in file:
    buffer += line
    if len(buffer) >= size:
      for start in xrange(0, len(buffer) - size + 1):
        yield buffer[start:start + size].upper()
      buffer = buffer[-(len(buffer) - size + 1):]
  for start in xrange(0, len(buffer) - size + 1):
    yield buffer[start:start + size]

def length(x):
  return sum(1 for _ in x)

def duration(secs):
  return "%dm %ds" % divmod(secs, 60)


def main(argv):
  tests = [parse, parse_sep_str, parse_tuple, parse_plain_str, parse_orig, parse_os_read]
  n = 0
  for fn in tests:
    n += 1
    with open(argv[1]) as f:
      start = time.time()
      length(fn(f))
      end = time.time()
      print "%d  %-20s  %s" % (n, fn.__name__, duration(end - start))

  fn = parse_mmap
  n += 1
  with open(argv[1]) as f:
    f = mmap.mmap(f.fileno(), 0, mmap.MAP_PRIVATE, mmap.PROT_READ)
    start = time.time()
    length(fn(f))
    end = time.time()
  print "%d  %-20s  %s" % (n, fn.__name__, duration(end - start))


if __name__ == "__main__":
  sys.exit(main(sys.argv))

1  parse                 1m 42s
2  parse_sep_str         1m 42s
3  parse_tuple           0m 29s
4  parse_plain_str       0m 36s
5  parse_orig            0m 45s
6  parse_os_read         0m 34s
7  parse_mmap            0m 37s

The first four are my code, while orig is yours and the last two are from other answers here.

User-defined objects are much more costly to create and collect than tuples or plain strings! This shouldn't be that surprising, but I had not realized it would make this much of a difference (compare #1 and #3, which really only differ in a user-defined class vs tuple). You said you want to store additional information, like offset, with the string anyway (as in the parse and parse_sep_str cases), so you might consider implementing that type in a C extension module. Look at Cython and related if you don't want to write C directly.

Case #1 and #2 being identical is expected: by pointing to a parent string, I was trying to save memory rather than processing time, but this test doesn't measure that.

Fred Nurk
  • 13,035
  • 4
  • 33
  • 61
  • Thank you for the answer, but I have to disagree on two points. The first is that this method uses very little memory at all. Assuming that the return result isn't kept for long (it is, but see point two), there's no objects that will stick around without being collected. It's close to fixed-memory. Point 2 is that I'm carefully monitoring RAM usage and I'm well under the wire. (The machine I'm running this on is a beast with 128 gigabytes of RAM, and the process only uses ~5G, and not from this code.) – eblume Jan 26 '11 at 04:13
  • @Fred: ...I'm confused about the purpose of instantiating a new object (Subseq) each and every time through this loop. Why not just `yield whole[offset: offset+size]` – Gerrat Jan 26 '11 at 04:20
  • @eblume: Fair enough; I was indeed thinking you kept the result, and that's where I meant to save memory rather than in the function itself. You still may want to consider this, as this uses less memory (so might fit better in cache) and will result in less heap fragmentation, etc. The genome is huge, but even so, I find 1.5 hrs long given your machine's specs; adapting something along the lines of what I have in the parse function may help with some of it. – Fred Nurk Jan 26 '11 at 04:25
  • @Fred: Now that I see your code example, I think I see a bit more what you are getting at. At some point though, I need to materialize the result to a string, as I am using those strings as a hash table lookup. Also, I added to my main post to explain that the 1.5 hour time includes other expensive operations. The function as listed purely above runs closer to ~30 minutes (I will find the exact time of that soon for benchmarking purposes.) – eblume Jan 26 '11 at 04:27
  • @Gerrat: When you get as many objects as will be here, a reference, an int, and a small int (size) will behave much better WRT memory use, fragmentation, and cache than a string of length size. – Fred Nurk Jan 26 '11 at 04:29
  • @Gerrat: That might work very well actually. I'll give that a shot shortly and if it works I'll gladly tip my hat to you and Fred. It's similar to another concept I was trying with mmap'ed files, but trades some upfront performance cost in stripping each line for not needing to weed out the newlines as you yield substrings. – eblume Jan 26 '11 at 04:30
  • @eblume: `def __str__(self): return self.parent[self.offset:self.offset + self.size]` takes care of the former problem (just use str(obj) as needed); you can also define a \_\_hash\_\_ method: `def __hash__(self): return hash(str(self))` (using the \_\_str\_\_ here; it may be worthwhile to cache this hash). – Fred Nurk Jan 26 '11 at 04:32
  • @Fred: (please don't think I'm trying to be argumentative here - I'm just trying to understand). At the end of the day, doesn't this __str__ method just basically do the exact same thing as my suggestion (except creating an extra object along the way)? – Gerrat Jan 26 '11 at 04:37
  • @Gerrat: I don't think you are being argumentative. :) If the string value is constantly needed around as a string object, yes, there isn't much point to this class. However, if it's only needed very temporarily (e.g. output or hashing – and the hashing could be much improved, hash(str(self)) is a shortcut), then that isn't so. I'll update to include a \_\_getitem\_\_ that may help show that to you. I believe eblume's use falls into the second category. – Fred Nurk Jan 26 '11 at 04:41
  • It just so happens that the offset is itself a critical piece of information *and* I was already using a placeholder class instead of just returning the flat string to make manipulation (like computing the opposing strand read) easier. As such, this implementation was easy. Also, I didn't even know about hash() - I feel dumb. I'm running the test now. – eblume Jan 26 '11 at 04:57
  • @Fred: Thanks for the clarification...I'm still not "getting" it, but at this point it's academic (I think we'd just be playing Q&A ping pong). However eblume ends up implementing this; I think the real gain was your suggestion to read the file in one go. Regards. [EDIT: with eblume's latest comment about using a hash & instantiating a class anyway, it makes more sense now] – Gerrat Jan 26 '11 at 05:02
  • Unfortunately I'm not seeing an appreciable performance increase - I suspect that I am spending a proportionally greater amount of time computing the opposing strand. I'll spend time tomorrow profiling this new code to see where I can gain in performance. – eblume Jan 26 '11 at 05:31
  • @eblume: Yeah, I thought memory probably wasn't the issue once you said you have 128G... :) Given what I know of the problem, I still believe my answer would improve performance, but it's likely to be much less significant than I suspected. (BTW, unaccept this answer, unless you really do feel I've actually answered the question.) – Fred Nurk Jan 26 '11 at 05:40
  • 2
    Fred I can't believe the amount of work you put in to this last night (well, night where I am.) Thanks so much! I will try and honor it by thoroughly testing and profiling each of the high-performing methods, though I do now (after a good sleep) predict that I was mistaken in trying to optimize my IO routine, and need to instead optimize the transformations I'm performing on the IO. – eblume Jan 26 '11 at 17:31
3

Some classic IO bound changes.

  • Use a lower level read operation like os.read and read in to a large fixed buffer.
  • Use threading/multiprocessing where one reads and buffers and the other processes.
  • If you have multiple processors/machines use multiprocessing/mq to divy up processing across CPUs ala map-reduce.

Using a lower level read operation wouldn't be that much of a rewrite. The others would be pretty large rewrites.

dietbuddha
  • 7,732
  • 27
  • 33
  • 1
    Os.read doesn't seem to make a significant difference (results in my answer). – Fred Nurk Jan 26 '11 at 11:43
  • Thanks for the answer. I am considering refactoring the problem for a map-reduce approach, but I must admit that my hesitance stems from not having seen the algorithm in action and thus being a bit naive about it. Could you recommend me any particular literature? – eblume Jan 26 '11 at 17:33
  • I ment to add this as well - I actually already am parallelizing to a very high degree, which shows substantial IO benefits. After asking this question, I'm beginning to suspect that my problem is actually not IO at all, but the transformations I'm doing on that IO. – eblume Jan 26 '11 at 17:34
  • I'm marking this as accepted because after several weeks of testing (including testing other parts of the program I'm using this code for), the method we went with simply reads in the entire chromosome with fileobj.read(). os.read does outperform this, but only by a small margin. Others have offered the same answer to this question but I felt that this answer got to it in the most succinct way. Thanks! – eblume Mar 12 '11 at 00:02
1

I have a function for process a text file and use buffer in read and write and parallel computing with async pool of workets of process. I have a AMD of 2 cores, 8GB RAM, with gnu/linux and can process 300000 lines in less of 1 second, 1000000 lines in aproximately 4 seconds and aproximately 4500000 lines (more of 220MB) in aproximately 20 seconds:

# -*- coding: utf-8 -*-
import sys
from multiprocessing import Pool

def process_file(f, fo="result.txt", fi=sys.argv[1]):
    fi = open(fi, "r", 4096)
    fo = open(fo, "w", 4096)
    b = []
    x = 0
    result = None
    pool = None
    for line in fi:
        b.append(line)
        x += 1
        if (x % 200000) == 0:
            if pool == None:
                pool = Pool(processes=20)
            if result == None:
                result = pool.map_async(f, b)
            else:
                presult = result.get()
                result = pool.map_async(f, b)
                for l in presult:
                    fo.write(l)
            b = []
    if not result == None:
        for l in result.get():
            fo.write(l)
    if not b == []:
        for l in b:
            fo.write(f(l))
    fo.close()
    fi.close()

First argument is function that rceive one line, process and return result for will write in file, next is file of output and last is file of input (you can not use last argument if you receive as first parameter in your script file of input).