I want to process several idependent csv files of similar sizes (100 MB) in parallel with PySpark. I'm running PySpark on a single machine: spark.driver.memory 2g spark.executor.memory 2g local[4]
File content: type (has the same value within each csv), timestamp, price
First I tested it on one csv:
logData = spark.read.csv("TypeA.csv", header=False,schema=schema)
// Compute moving avg.
w = (Window.partitionBy("type").orderBy(f.col("timestamp").cast("long")).rangeBetween(-24*7*3600 * i, 0))
logData = logData.withColumn("moving_avg", f.avg("price").over(w))
// Some other simple operations... No Agg, no sort
logData.write.parquet("res.pr")
This works great. However, when I tried to run it now on 23 files:
logData = spark.read.csv('TypeB*.csv', header=False,schema=schema)
The job fails with OOM java heap. Increasing the memory to 10g helps (5g and 7g failed), however it doesn't scale, since I need to process 600 files.
The question is why PySpark doesn't spill to disk to prevent OOM, or how can I hint the Spark to do it?
Alternative solution I guess would be to sequentially read the files and process them one by one with Spark (but this is potentially slower?).