In Spark, I have an RDD that contains millions of paths to local files (we have a shared file system, so they appear local). In Scala, how would I create an RDD that consists of all the lines in each of those files?
I tried doing something like this:
paths.flatMap(path => sc.textFile(path))
But that didn't work. I also tried something like this:
paths.flatMap(path =>
scala.io.Source.fromInputStream(new java.io.FileInputStream(path)).getLines
)
That worked when working locally but didn't when running on multiple machines. I ended up with this error:
java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:277)
Any pointers would be appreciated
(Most solutions point so far involve passing the list of files to sc.textFile all at once, which is not possible since the list can be very large. A typical use case right now would yield 20M paths, which doesn't fit in a single Java String).