PySpark OOM for multiple data files

Question

I want to process several idependent csv files of similar sizes (100 MB) in parallel with PySpark. I'm running PySpark on a single machine: spark.driver.memory 2g spark.executor.memory 2g local[4]

File content: type (has the same value within each csv), timestamp, price

First I tested it on one csv:

    logData = spark.read.csv("TypeA.csv", header=False,schema=schema)
    // Compute moving avg.
    w = (Window.partitionBy("type").orderBy(f.col("timestamp").cast("long")).rangeBetween(-24*7*3600 * i, 0))
    logData = logData.withColumn("moving_avg", f.avg("price").over(w))
    // Some other simple operations... No Agg, no sort
    logData.write.parquet("res.pr")

This works great. However, when I tried to run it now on 23 files:

logData = spark.read.csv('TypeB*.csv', header=False,schema=schema)

The job fails with OOM java heap. Increasing the memory to 10g helps (5g and 7g failed), however it doesn't scale, since I need to process 600 files.

The question is why PySpark doesn't spill to disk to prevent OOM, or how can I hint the Spark to do it?

Alternative solution I guess would be to sequentially read the files and process them one by one with Spark (but this is potentially slower?).

How much RAM does the computer have in total? You probably need to lower the memory allocated to Spark. — Lars Skaug, Aug 23 '20 at 19:47
Have a look at this post. Make sure you set the memory before you start your application. spark.executor.memory does not apply. https://stackoverflow.com/questions/26562033/how-to-set-apache-spark-executor-memory — Lars Skaug, Aug 23 '20 at 20:03
I don't have a problem with setting the driver memory - it works. My issue is that PySpark goes OOM and don't spill data to the disk instead, since it clearly doesn't need to hold everything in memory for processing, as computations are independent. — Le_Coeur, Aug 23 '20 at 22:52

PySpark OOM for multiple data files

0 Answers0