I have written a short Scala program to read a large file, process it and store the result in another file. The file contains about 60000 lines of numbers, and I need to extract from each third line only the first number. Eventually I save those numbers to a different file. Although numbers, I treat them as strings all along the way.
Here is the Scala code:
import scala.io.Source
import java.io.BufferedWriter
import java.io.FileWriter
object Analyze {
def main(args: Array[String]) {
val fname = "input.txt"
val counters = Source.fromFile(fname).mkString.split("\\n").grouped(3)
.map(_(2).split("\\s+")(0))
val f = new BufferedWriter(new FileWriter("output1.txt"))
f.write(counters.reduceLeft(_ + "\n" + _))
f.close()
}
}
I like very much the Scala's capability of powerful one liners. The one-liner in the above code reads the entire text from the file, splits it into lines, groups the lines to groups of 3 lines, and then takes from each group the third line, splits it and takes the first number.
Here is the equivalient python script:
fname = 'input.txt'
with file(fname) as f:
lines = f.read().splitlines()
linegroups = [lines[i:i+3] for i in range(0, len(lines), 3)]
nums = [linegroup[2].split()[0] for linegroup in linegroups]
with file('output2.txt', 'w') as f:
f.write('\n'.join(nums))
Python is not capable of such one liners. In the above script the first line of code reads the file into a list of lines, the next one groups the lines into groups of 3, and the next one creates a list consisting of the first number of every last line of each group. It's very similar to the Scala code, only it runs much much faster.
The python script runs in a fraction of a second on my laptop, while the Scala program runs for 15 seconds! I commented out the code that saves the result to the file, and the duration fell to 5 seconds, which is still way too slow. Also I don't understand why it takes so long to save the numbers to the file. When I dealt with larger files, the python script ran for a few seconds, while the Scala program running time was in order of minutes, which I couldn't use to analyze my files.
I'll appreciate you advice for this issue. Thanks