2

We run a Spark Streaming job on AWS EMR. This job will run stably for anywhere between 10 and 14 hours, and then crash with no discernible errors in stderr, stdout, or Cloudwatch logs. After this crash, any attempts to restart the job will immediately fail with "'Cannot allocate memory' (errno=12)" (full message).

Investigation with both Cloudwatch metrics and Ganglia show that driver.jvm.heap.used is steadily growing over time.

Both of these observations led me to believe that some long-running component of Spark (i.e. above Job-level) was failing to free memory correctly. This is supported by the fact that restarting the hadoop-yarn-resourcemanager (as per here) causes heap usage to drop to "fresh cluster" levels.

If my assumption there is indeed correct - what would cause Yarn to keep consuming more and more memory? (If not - how could I falsify that?)

  • I see from here that spark.streaming.unpersist defaults to true (although I've tried adding a manual rdd.unpersist() at the end of my job anyway just to check whether that has any effect - it hasn't been running long enough to tell definitively, yet)
  • Here, the comment on spark.yarn.am.extraJavaOptions suggests that, when running in yarn-client mode (which we are), spark.yarn.am.memory sets the maximum Yarn Application Manager heap memory usage. This value is not overridden in our job (so should be at the default of 512m), but both Cloudwatch and Ganglia clearly show driver heap usage in the Gigabytes.
Community
  • 1
  • 1
scubbo
  • 3,668
  • 5
  • 29
  • 51
  • 2
    Are you doing stateful streaming by chance? That could cause ever increasing memory use if your state (which is kept in memory) isn't carefully managed. I can tell you, that we are running streaming jobs on EMR as well and they run without memory problems, so I doubt it's a general problem you are seeing... – Glennie Helles Sindholt Oct 29 '16 at 06:53

1 Answers1

2

It turns out that the default SparkUI values here were much larger than our system could handle. After setting them down to 1/20th of the default values, the system has been running stably for 24 hours with no increase in heap usage over that time.

For clarity, the values that were edited were:

* spark.ui.retainedJobs=50
* spark.ui.retainedStages=50
* spark.ui.retainedTasks=500
* spark.worker.ui.retainedExecutors=50
* spark.worker.ui.retainedDrivers=50
* spark.sql.ui.retainedExecutions=50
* spark.streaming.ui.retainedBatches=50
scubbo
  • 3,668
  • 5
  • 29
  • 51
  • Could you add more details, please? –  Oct 31 '16 at 21:51
  • Certainly - what details would you like? – scubbo Nov 01 '16 at 19:04
  • I think that a list of configuration options you modified would be in place. It looks like a sneaky issue and others could benefit from clean guide instead of trial and error :) Thanks! –  Nov 01 '16 at 19:06
  • This blog post develops a bit more about the same issue: http://www.xfittingthedata.com/index.php/2018/02/18/saving-memory-from-spark-ui/ – Davide Mandrini Dec 21 '18 at 08:47
  • @Glennie Helles Sindholt Hi. Question: Does the answer relate to the question in your view? spark.ui.retainedJobs=50 etc. are for the App itself. May be I am missing something. – thebluephantom Oct 21 '20 at 19:49
  • spark.ui.retainedJobs 1000 How many jobs the Spark UI and status APIs remember before garbage collecting. This is a target maximum, and fewer elements may be retained in some circumstances. 1.2 What system do you mean in fact? – thebluephantom Oct 21 '20 at 19:59