3

I have written a short Scala program to read a large file, process it and store the result in another file. The file contains about 60000 lines of numbers, and I need to extract from each third line only the first number. Eventually I save those numbers to a different file. Although numbers, I treat them as strings all along the way.

Here is the Scala code:

import scala.io.Source
import java.io.BufferedWriter
import java.io.FileWriter

object Analyze {
  def main(args: Array[String]) {
      val fname = "input.txt"

      val counters = Source.fromFile(fname).mkString.split("\\n").grouped(3)
        .map(_(2).split("\\s+")(0))

      val f = new BufferedWriter(new FileWriter("output1.txt"))
      f.write(counters.reduceLeft(_ + "\n" + _))
      f.close()
  }
}

I like very much the Scala's capability of powerful one liners. The one-liner in the above code reads the entire text from the file, splits it into lines, groups the lines to groups of 3 lines, and then takes from each group the third line, splits it and takes the first number.

Here is the equivalient python script:

fname = 'input.txt'

with file(fname) as f:
    lines = f.read().splitlines()
    linegroups = [lines[i:i+3] for i in range(0, len(lines), 3)]
    nums = [linegroup[2].split()[0] for linegroup in linegroups]

with file('output2.txt', 'w') as f:
    f.write('\n'.join(nums))    

Python is not capable of such one liners. In the above script the first line of code reads the file into a list of lines, the next one groups the lines into groups of 3, and the next one creates a list consisting of the first number of every last line of each group. It's very similar to the Scala code, only it runs much much faster.

The python script runs in a fraction of a second on my laptop, while the Scala program runs for 15 seconds! I commented out the code that saves the result to the file, and the duration fell to 5 seconds, which is still way too slow. Also I don't understand why it takes so long to save the numbers to the file. When I dealt with larger files, the python script ran for a few seconds, while the Scala program running time was in order of minutes, which I couldn't use to analyze my files.

I'll appreciate you advice for this issue. Thanks

Israel Unterman
  • 11,748
  • 2
  • 22
  • 31
  • 2
    I haven't tried it out, but it strikes me that `mkString.split("\\n")` cannot possibly be efficient. Try `Source.fromPath("myfile.txt").getLines()` or something equivalent instead. – notan3xit Aug 10 '11 at 21:13
  • 4
    You run this code as a script (with `scala` command) or precompiled? I think you has that problem because scala is not a sprinter, it's a long distance runner (a lot of time spent at start - starting JVM, checking classpath and so on). – om-nom-nom Aug 10 '11 at 21:14
  • Hi, I tried replacing mkString.split("\\n") with getLines and it indeed improved the speed significantly - one second to collect the data. Still saving it takes about 5 seconds. I figured out that `reduceLeft` on the large amount of data ate the rest of the time, so I converted it to a simple `for (c – Israel Unterman Aug 10 '11 at 22:11

4 Answers4

10

I took the liberty of cleaning up the code, this should run more efficiently by avoiding the initial mkString, not needing a regex to perform the whitespace split, and not pre-aggregating the results before writing them out. I also used methods that are better self-documenting:

val fname = "input.txt"
val lines = (Source fromFile fname).getLines
val counters =
  (lines grouped 3 withPartial false) map { _.last takeWhile (!_.isWhitespace) }

val f = new BufferedWriter(new FileWriter("output1.txt"))
f.write(counters mkString "\n")
f.close()

Warning, untested code

This is largely irrelevant though, depending on how you're profiling. If you're including the JVM startup time in your metrics, then all bets are off - no amount of code optimisation could help you there.

I'd normally also suggest pre-warming the JVM by running the routine a few hundred times before you time it, but this isn't so practical in the face of file I/O.

Kevin Wright
  • 48,726
  • 9
  • 100
  • 155
  • I think better use `counters foreach { f.println }` with PrintWriter `f`. – incrop Aug 11 '11 at 07:46
  • @incrop - I did consider it, but that would leave an additional trailing `\n` as compared to the original... Probably the right thing to do, but I also wanted to avoid changing behaviour. – Kevin Wright Aug 11 '11 at 11:31
6

I timed the version provided by Kevin with minor edits (removed withPartial since the python version doesn't handle padding either):

import scala.io.Source
import java.io.BufferedWriter
import java.io.FileWriter

object A extends App {

  val fname = "input.txt"
  val lines = (Source fromFile fname).getLines
  val counters =
    (lines grouped 3) map { _.last takeWhile (!_.isWhitespace) }

  val f = new BufferedWriter(new FileWriter("output1.txt"))
  f.write(counters mkString "\n")
  f.close()
}

With 60,000 lines here are the timing:

$ time scala -cp classes A

real    0m2.823s

$ time /usr/bin/python A.py

real    0m0.437s

With 900,000 lines:

$ time scala -cp classes A

real    0m5.226s

$ time /usr/bin/python A.py

real    0m3.319s

With 2,700,000 lines:

$ time scala -cp classes A

real    0m9.516s

$ time /usr/bin/python A.py

real    0m10.635s

The scala version outperforms the python version after that. So it seems some of the long timing is due to JVM initialization and JIT compilation time.

huynhjl
  • 40,642
  • 14
  • 99
  • 158
  • 10
    Also fun fact: `awk 'NR % 3 == 0 { print $1}' input.txt > output.txt` outperforms python ans scala on those same input files. – huynhjl Aug 11 '11 at 04:01
  • Oh yes... this is definitely a job for awk – Kevin Wright Aug 11 '11 at 11:36
  • Thank you very much for the time you spent to perform the experiment. Indeed `mkString` is far more quicker than `reduce`. I tried to repeat the experiment, but unfortunately got different results. For a 900,000 lines python finished in 4.2 seconds, while Scala finished in 54 seconds. Now, the code creating `counters` runs at a bigger speed in Scala than it python - 0.3 seconds vs 4 seconds. But the code that converts `counters` to lines of text took python a fraction of a second (using the `join` function) while for Scala it took about 54 seconds. – Israel Unterman Aug 11 '11 at 20:14
  • I also compared Jython (python port to JVM) on the 60000 lines file, and it operated faster than Scala - 1.1 seconds vs 3.4 seconds. They both run on the JVM, so now I'm even more confused.. – Israel Unterman Aug 11 '11 at 20:15
  • @CodeChords man, Did you time your original version on 900,000 lines or Kevin's (or my variant)? What Scala and Java version are you using? It seems something is wrong with your setup. For comparison, I'm using 2.9.0.final on HotSpot(TM) Client VM, Java 1.6.0_25 on a Core 2 Duo 1.5GHz with 256MB of max heap size. – huynhjl Aug 12 '11 at 02:24
  • Hi, I use Scala 2.8.1 final with java 1.6.0_24. I tested it again with your exact code. again 4 seconds vs 46 seconds. My laptop is core due 2.1 GHz (don't know what the max heap size is). – Israel Unterman Aug 12 '11 at 11:15
  • @CodeChords man, 2.8.1 does not support the `App` trait. Did you replace `extends App` with nothing and wrapped the code into a `def main(args: Array[String])`? If you replaced `App` with `Application` the JVM won't JIT it. With that said, I can confirm that 2.8.1 is much slower than 2.9.0 on this test. 16s versus 5s on my 900k lines test. So something got optimized since 2.8.1... – huynhjl Aug 12 '11 at 15:00
  • Actually yes, I wrapped your code with `def main(...)` - and inside it I used your exact code. I updated to Scala 2.9.0.1, and now the situation is quite different. Both Scala and Python run about the same time. For 900,000 lines - about 4.5 seconds. 1,800,000 lines - about 9.5 seconds. Python is still a little bit a head but this can be considered negligible. I think 2.8 version of scala was slow in the iterators part. The time consuming task was converting the iterator to a list by `mkString` - since all previous stages produced an iterator rather than a list. – Israel Unterman Aug 13 '11 at 22:19
  • Hi again, now I removed `def main` and used `object Analyze extends App` and the resulting time decreased a little below Python's - Python 9.25 sec, Scala - 9.13 sec. So the 2.9.0 version does it, and also it turns out that `extends App` somehow works better than `def main`, at least for short applications like this one. Probably relating to the JVM initialization internals. - Thanks for your help! – Israel Unterman Aug 13 '11 at 22:23
0

Try this code for write to file:

val f = new java.io.PrintWriter(new java.io.File("output1.txt"))
f.write(counters.reduce(_ + "\n" + _))
f.close()

Much faster .

0

In addition to @notan3xit's answer, you could also write counters to files without concatenating them first:

  val f = new BufferedWriter(new FileWriter("output1.txt"))
  f.write(counters.head.toString)
  counters.tail.foreach(c => f.write("\n" + c.toString))
  f.close()

Though you could do the same in Python...

Alexey Romanov
  • 154,018
  • 31
  • 276
  • 433
  • Hi, thanks for the suggestion, I tried something similiar to this, which I commented on to @notan3xit. Python's byte code is much less efficient than Java, because it's a dynamic language, and there is no compile time information about types, and nevertheless the script ran faster. I also don't understand why the `reduceLeft` takes so long. I thought maybe the python's `join` function is C-optimized, so I replaced it with python's `reduce` and it still took a fraction of a second. Perhaps I need to re-write the application in Java to see whether this is a language or a JVM issue. – Israel Unterman Aug 10 '11 at 22:15
  • Yes, this seems quite weird. You haven't answered @om-nom-nom's comment: are you compiling Scala with `scalac` or running as a script with `scala`? – Alexey Romanov Aug 10 '11 at 23:25
  • Also, `reduceLeft` is concatenating strings in a loop, and creating a lot of garbage, while `join` doesn't need to; but of course the same should apply to `reduce`... – Alexey Romanov Aug 10 '11 at 23:31