Recursively walk a LARGE directory using Scala 2.8 continuations

Question

Is it possible to recursively walk a directory using Scala continuations (introduced in 2.8)?

My directory contains millions of files, so I cannot use a Stream because I will get an out-of-memory. I am trying to write an Actor dispatch to have worker actors process the files in parallel.

Does anyone have an example?

Not an answer to your question, but I've happily used [Java 7's `java.nio.file.FileVisitor`](http://docs.oracle.com/javase/7/docs/api/java/nio/file/FileVisitor.html) in Scala to work with directories containing hundreds of thousands of files. It would probably fit reasonably well with `Actor`-based processing. — Travis Brown, Mar 22 '12 at 17:35
Looks good, except I want to try to process at least a couple of files in parallel. — Ralph, Mar 22 '12 at 17:36
I'm missing something re FileVisitor - how does it prevent parallel processing? Since each file requires a lot of processing, pass each file visited on to an Actor. — Ed Staub, Mar 22 '12 at 17:45
You're right. Didn't think of that. Too many Streams in my head right now :-). — Ralph, Mar 22 '12 at 18:12

Rex Kerr · Accepted Answer · 2012-03-22T18:30:58.163

If you want to stick with Java 1.6 (as opposed to FileVistor in 1.7), and you have subdirectories instead of all your millions of files in just one directory, you can

class DirectoryIterator(f: File) extends Iterator[File] {
  private[this] val fs = Option(f.listFiles).getOrElse(Array[File]())
  private[this] var i = -1
  private[this] var recurse: DirectoryIterator = null
  def hasNext = {
    if (recurse != null && recurse.hasNext) true
    else (i+1 < fs.length)
  }
  def next = {
    if (recurse != null && recurse.hasNext) recurse.next
    else if (i+1 >= fs.length) {
      throw new java.util.NoSuchElementException("next on empty file iterator")
    }
    else {
      i += 1;
      if (fs(i).isDirectory) recurse = new DirectoryIterator(fs(i))
      fs(i)
    }
  }
}

This requires that your filesystem has no loops. If it does have loops, you need to keep track of the directories you hit in a set and avoid recursing them again. (If you don't even want to hit the files twice if they're linked from two different places, you then have to put everything into a set, and there's not much point using an iterator instead of just reading all the file info into memory.)

Bah. You had to go into more details than my one-liner response, didn't you? And *before* I answered too! — Daniel C. Sobral, Mar 22 '12 at 18:45

score 1 · Answer 2 · answered Mar 22 '12 at 17:32

1

This is more questioning the question, than an answer.

If your process is I/O bound, parallel processing may not improve your throughput much. In many cases, it will make it worse, by causing disk head thrashing. Before you do much along this line, see how busy the disk is. If it's already busy most of the time with a single thread, at most one more thread will be useful - and even that may be counterproductive.

answered Mar 22 '12 at 17:32

Ed Staub

14,730
3
55
86

Actually, I have do do a lot of processing on each file, so other threads should be able to do some useful work. – Ralph Mar 22 '12 at 17:34
I'd guess that anyone using the word "large" when talking about stuff stored on disk is probably talking about disk arrays with real controllers, so you're not just talking about the heads on one physical unit. But you're right about the basic point of looking at actual IO behavior before making software changes. – James Moore Nov 26 '12 at 18:25
@Ed Staub: And in this case, the files are on SSDs. Once each file is loaded, I have a lot of CPU intensive activity to perform on the contents. Parallel processing seems like exactly the right thing, considering I have many hyperthreaded cores. – Ralph Nov 27 '12 at 11:57
@Ralph - All I'm saying is that if and when your IO subsystem is saturated, adding more threads will hurt, not help. With hard disks, the effect is exacerbated by excessive seeking, but even without that, context-switching and lowered cache locality become a concern. _I didn't mean to say anything about the case where the IO subsystem is not saturated._ If you're compute-bound, by all means - go for it. – Ed Staub Nov 27 '12 at 14:54

score 0 · Answer 3 · answered Mar 22 '12 at 18:45

0

What about using an Iterator?

answered Mar 22 '12 at 18:45

Daniel C. Sobral

284,820
82
479
670

Recursively walk a LARGE directory using Scala 2.8 continuations

3 Answers3