20

I am trying to load a database with 1TB data to spark on AWS using the latest EMR. And the running time is so long that it doesn't finished in even 6 hours, but after running 6h30m , I get some error announcing that Container released on a lost node and then the job failed. Logs are like this:

16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144178.0 in stage 0.0 (TID 144178, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144181.0 in stage 0.0 (TID 144181, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144175.0 in stage 0.0 (TID 144175, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144213.0 in stage 0.0 (TID 144213, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5, ip-10-0-2-176.ec2.internal, 43922)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 5 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 6 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 5 has been removed (new total is 41)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144138.0 in stage 0.0 (TID 144138, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144185.0 in stage 0.0 (TID 144185, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144184.0 in stage 0.0 (TID 144184, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144186.0 in stage 0.0 (TID 144186, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 0)
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(6, ip-10-0-2-173.ec2.internal, 43593)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 30 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144162.0 in stage 0.0 (TID 144162, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 6 has been removed (new total is 40)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144156.0 in stage 0.0 (TID 144156, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144170.0 in stage 0.0 (TID 144170, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144169.0 in stage 0.0 (TID 144169, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 30 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000024 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node

I am pretty sure that my network setting works because I have tried to run this script on the same environment on a much smaller table.

Also, I am aware that somebody posted a question 6 months ago asking for the same issue:spark-job-error-yarnallocator-exit-status-100-diagnostics-container-released but I still have to ask because nobody was answering this question.

Community
  • 1
  • 1
John Zeng
  • 914
  • 2
  • 7
  • 20
  • 1
    I am hitting the same issue. No answers :( – clay Sep 05 '16 at 17:08
  • 1
    @clay Just my guess. The spot instance will be taken back when the price get higher than your price, and then the node will be lost. So if you are running on a long term job, don't use spot instance . I find a way to split my data set to many little task each of which just run for 5 minutes, and save a reduce result on s3, after all of that, read the result from s3 and do another reduce, so I can avoid long running job. – John Zeng Sep 06 '16 at 01:07
  • I'm hitting this issue as well :/ – Prayag Dec 27 '16 at 19:03
  • Similar issue here (with a big self-join however). Been hitting it for a bit now. Logs on Resource Manager just say the container was lost. There is no indication as to why. Memory might be an issue. – Navneet Jan 03 '17 at 20:32
  • Can you share the logs of the node? – ssedano Jun 23 '17 at 09:38
  • @ssedano No... the instance was deleted long time ago. And the logs are so large that you wouldn't want to read that. – John Zeng Jun 26 '17 at 03:37
  • Ran into the same issue. Also used spot instance but not sure that's the root cause because we used pretty high bid price and never lost any instances during the job execution – seiya Dec 21 '17 at 20:24
  • where did you save your 1TB data? is that on S3 or HDFS? – SharpLu Jul 26 '19 at 08:44

5 Answers5

11

Looks like other peoples has the same issue as well, so I just post an answer instead of writing a comment. I am not sure that this would solve the issue but this should be an idea.

If you use spot instance, you should know that spot instance will be shut down if the price is higher than your input, and you will hit this issue. Even if you are just using a spot instance as a slave. So my solution is not using any spot instance for long term running job.

Another idea is to slice the job into many independent steps, so you can save the result of each step as a file on S3. If any error happened, just start from that step by the cached files.

John Zeng
  • 914
  • 2
  • 7
  • 20
  • So based on your solutions, first option is to: get dedicated CORE nodes instead of SPOT Task nodes. Second option is to basically break up your job into multiple jobs and run them manually in a progressive manner? – thentangler Apr 14 '21 at 14:09
3

is it dynamic allocation of memory ? I had similar issue I fixed it by going with static allocation by calculating executor memory, executor cores and executors. Try Static allocation for huge workloads in Spark.

  • does unpersisting the unused DFs help ? in this case ?? –  Oct 20 '18 at 09:46
  • you can try, are you on EMR or Cloudera stack? also check yarn scheduler for resource management is it fair or capacity and then try static memory allocation instead of dynamic by passing number of executors etc.... – sri hari kali charan Tummala Oct 22 '18 at 20:42
  • Iam using EMR and i dont find any difference in dynamic memory change after using the unpersist as well. –  Oct 23 '18 at 06:43
  • what I am asking you is to use static memory allocation by turning off dynamic, pass in no of executors, executor memory and executor cores by calculating them instead of leaving it to Spark Dynamic memory allocation – sri hari kali charan Tummala Oct 23 '18 at 16:33
2

This means your YARN container is down, to debug what happened, you must read YARN logs, use the official CLI yarn logs -applicationId or feel free to use and contribute to my project https://github.com/ebuildy/yoga a YARN viewer as web app.

You should see lot of Worker errors.

Thomas Decaux
  • 18,451
  • 2
  • 83
  • 95
2

I was hitting the same problem. I found some clues in this article on DZone:
https://dzone.com/articles/some-lessons-of-spark-and-memory-issues-on-emr

This one was solved by increasing the number of DataFrame partitions (in this case, from 1,024 to 2,048). That reduced the needed memory per partition.


So I tried to increase the number of DataFrame partitions which solved my issue.

0

AWS has released this as FAQ

For EMR: https://aws.amazon.com/premiumsupport/knowledge-center/emr-exit-status-100-lost-node/

For Glue job: https://aws.amazon.com/premiumsupport/knowledge-center/container-released-lost-node-100-glue/

Jay Modi
  • 205
  • 2
  • 6
Sach
  • 3,115
  • 3
  • 17
  • 35