0

I want to process several idependent csv files of similar sizes (100 MB) in parallel with PySpark. I'm running PySpark on a single machine: spark.driver.memory 2g spark.executor.memory 2g local[4]

File content: type (has the same value within each csv), timestamp, price

First I tested it on one csv:

    logData = spark.read.csv("TypeA.csv", header=False,schema=schema)
    // Compute moving avg.
    w = (Window.partitionBy("type").orderBy(f.col("timestamp").cast("long")).rangeBetween(-24*7*3600 * i, 0))
    logData = logData.withColumn("moving_avg", f.avg("price").over(w))
    // Some other simple operations... No Agg, no sort
    logData.write.parquet("res.pr")

This works great. However, when I tried to run it now on 23 files:

logData = spark.read.csv('TypeB*.csv', header=False,schema=schema)

The job fails with OOM java heap. Increasing the memory to 10g helps (5g and 7g failed), however it doesn't scale, since I need to process 600 files.

The question is why PySpark doesn't spill to disk to prevent OOM, or how can I hint the Spark to do it?

Alternative solution I guess would be to sequentially read the files and process them one by one with Spark (but this is potentially slower?).

Le_Coeur
  • 2,007
  • 5
  • 28
  • 42
  • How much RAM does the computer have in total? You probably need to lower the memory allocated to Spark. – Lars Skaug Aug 23 '20 at 19:47
  • 32g RAM in total – Le_Coeur Aug 23 '20 at 19:57
  • Have a look at this post. Make sure you set the memory before you start your application. spark.executor.memory does not apply. https://stackoverflow.com/questions/26562033/how-to-set-apache-spark-executor-memory – Lars Skaug Aug 23 '20 at 20:03
  • I don't have a problem with setting the driver memory - it works. My issue is that PySpark goes OOM and don't spill data to the disk instead, since it clearly doesn't need to hold everything in memory for processing, as computations are independent. – Le_Coeur Aug 23 '20 at 22:52

0 Answers0