We run a Spark Streaming job on AWS EMR. This job will run stably for anywhere between 10 and 14 hours, and then crash with no discernible errors in stderr, stdout, or Cloudwatch logs. After this crash, any attempts to restart the job will immediately fail with "'Cannot allocate memory' (errno=12)" (full message).
Investigation with both Cloudwatch metrics and Ganglia show that driver.jvm.heap.used
is steadily growing over time.
Both of these observations led me to believe that some long-running component of Spark (i.e. above Job-level) was failing to free memory correctly. This is supported by the fact that restarting the hadoop-yarn-resourcemanager (as per here) causes heap usage to drop to "fresh cluster" levels.
If my assumption there is indeed correct - what would cause Yarn to keep consuming more and more memory? (If not - how could I falsify that?)
- I see from here that
spark.streaming.unpersist
defaults to true (although I've tried adding a manualrdd.unpersist()
at the end of my job anyway just to check whether that has any effect - it hasn't been running long enough to tell definitively, yet) - Here, the comment on
spark.yarn.am.extraJavaOptions
suggests that, when running in yarn-client mode (which we are),spark.yarn.am.memory
sets the maximum Yarn Application Manager heap memory usage. This value is not overridden in our job (so should be at the default of 512m), but both Cloudwatch and Ganglia clearly show driver heap usage in the Gigabytes.