Outputting billions of lines from Spark

Question

I'm trying to output an RDD that has ~5,000,000 rows as a text file using PySpark. It's taking a long time, so what are some tips on how to make the .saveAsTextFile() faster?

The rows are 3 columns each, and I'm saving to HDFS.

Could you give us an idea of how long it's taking, on what kind of cluster and job configuration? How big are the rows? — DNA, May 12 '16 at 22:05

score 1 · Answer 1 · answered May 13 '16 at 22:44

Without knowing how long a long time is or knowing the size of each individual row or the dimensions of the cluster, I can only make a couple of guesses.

First, in general, Spark will output a single file per partition. If your RDD is a single (or few) partitions then outputting to HDFS or GCS will appear to be slow. Consider repartitioning before outputting (repartition will also take time, if you can work the repartition into the pipeline so that it does useful work all the better). You can always call RDD#getNumPartitions to see how many partitions are in an RDD and reparatition intelligently if needed.

The second possibility I can think of is that that your HDFS could be under provisioned (e.g., out of space) or having issues causing errors that are not being surfaced very well. I would expect any HDFS write errors to be visible to the driver, but may be in container logs.

Outputting billions of lines from Spark

1 Answers1