31

I see exit codes and exit statuses all the time when running spark on yarn:

Here are a few:

  • CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM

  • ...failed 2 times due to AM Container for application_1431523563856_0001_000002 exited with exitCode: 10...

  • ...Exit status: 143. Diagnostics: Container killed on request

  • ...Container exited with a non-zero exit code 52:...

  • ...Container killed on request. Exit code is 137...

I have never found any of these messages as being useful....Is there any chance of interpreting what actually goes wrong with these? I have searched high and low for a table explaining the errors but nothing.

The ONLY one I am able to decipher from those above is exit code 52, but that's because I looked at the source code here. It is saying that is an OOM.

Should I stop trying to interpret the rest of these exit codes and exit statuses? Or am I missing some obvious way that these numbers actually mean something?

Even if someone could tell me the difference between exit code, exit status, and SIGNAL that would be useful. But I am just randomly guessing right now, and it seems as everyone else around me who uses spark is, too.

And, finally, why are some of the exit codes less than zero and how to interpret those?

E.g. Exit status: -100. Diagnostics: Container released on a *lost* node

makansij
  • 7,473
  • 28
  • 82
  • 156

1 Answers1

72

Neither exit codes and status nor signals are Spark specific but part of the way processes work on Unix-like systems.

Exit status and exit code

Exit status and exit codes are different names for the same thing. An exit status is a number between 0 and 255 which indicates the outcome of a process after it terminated. Exit status 0 usually indicates success. The meaning of the other codes is program dependent and should be described in the program's documentation. There are some established standard codes, though. See this answer for a comprehensive list.

Exit codes used by Spark

In the Spark sources I found the following exit codes. Their descriptions are taken from log statements and comments in the code and from my understanding of the code where the exit status appeared.

Spark SQL CLI Driver in Hive Thrift Server

  • 3: if an UnsupportedEncodingException occurred when setting up stdout and stderr streams.

Spark/Yarn

  • 10: if an uncaught exception occurred
  • 11: if more than spark.yarn.scheduler.reporterThread.maxFailures executor failures occurred
  • 12: if the reporter thread failed with an exception
  • 13: if the program terminated before the user had initialized the spark context or if the spark context did not initialize before a timeout.
  • 14: This is declared as EXIT_SECURITY but never used
  • 15: if a user class threw an exception
  • 16: if the shutdown hook called before final status was reported. A comment in the source code explains the expected behaviour of user applications:

    The default state of ApplicationMaster is failed if it is invoked by shut down hook. This behavior is different compared to 1.x version. If user application is exited ahead of time by calling System.exit(N), here mark this application as failed with EXIT_EARLY. For a good shutdown, user shouldn't call System.exit(0) to terminate the application.

Executors

  • 50: The default uncaught exception handler was reached
  • 51: The default uncaught exception handler was called and an exception was encountered while logging the exception
  • 52: The default uncaught exception handler was reached, and the uncaught exception was an OutOfMemoryError
  • 53: DiskStore failed to create local temporary directory after many attempts (bad spark.local.dir?)
  • 54: ExternalBlockStore failed to initialize after many attempts
  • 55: ExternalBlockStore failed to create a local temporary directory after many attempts
  • 56: Executor is unable to send heartbeats to the driver more than "spark.executor.heartbeat.maxFailures" times.

  • 101: Returned by spark-submit if the child main class was not found. In client mode (command line option --deploy-mode client) the child main class is the user submitted application class (--class CLASS). In cluster mode (--deploy-mode cluster) the child main class is the cluster manager specific submission/client class.

Exit codes greater than 128

These exit codes most likely result from a program shutdown triggered by a Unix signal. The signal number can be calculated by substracting 128 from the exit code. This is explained in more details in this blog post (which was originally linked in this question). There is also a good answer explaining JVM-generated exit codes. Spark works with this assumption as explained in a comment in ExecutorExitCodes.scala

Other exit codes

Apart from the exit codes listed above there are number of System.exit() calls in the Spark sources setting 1 or -1 as exit code. As far as I an tell -1 seems to be used to indicate missing or incorrect command line parameters while 1 indicates all other errors.

Signals

Signals are a kind of events which allow to send system messages to a process. These messages are used to ask a process to reload its configuration (SIGHUP) or to terminate itself (SIGKILL), for instance. A list of standard signals can be found in the signal(7) man page in section Standard Signals.

As explained by Rick Moritz in the comments below (thank you!), the most likely sources of signals in a Spark setup are

  • the cluster resource manager when the container size exceeded, the job finished, a dynamic scale-down was made, or a job was aborted by the user
  • the operating system: as part of a controlled system shut down or if some resource limit was hit (out of memory, over hard quota, no space left on disk etc.)
  • a local user who killed a job

I hope this makes it a bit clearer what these messages by spark might mean.

Christoph Böhme
  • 3,062
  • 1
  • 13
  • 26
  • I think the question was aiming for spark-specific reasons/causes for what actually triggers the signal. Also, since Java code or the JVM itself is generating those exit codes, I think we can get more specific than the standard codes. – Rick Moritz Aug 08 '17 at 14:01
  • Thank you for your input. I will try to expand my answer and add some more details on the actual exit codes used by the JVM and Spark later today. – Christoph Böhme Aug 08 '17 at 14:25
  • 1
    @RickMoritz: I added a section on the exit codes used by Spark. As I have not used Spark myself, the naming of the components might not be correct. I still think that the list is quite helpful. – Christoph Böhme Aug 08 '17 at 17:21
  • yes, it is true that I am more interested in the spark-specific reasons/codes. But this is still useful. – makansij Aug 08 '17 at 23:46
  • @ChristophBöhme I'm having trouble seeing which exit code would correspond to something like have too many or too few partitions. Am I missing which would tell me that it is a partitioning issue? OR is partitioning a symptom and not an illness? – makansij Aug 08 '17 at 23:47
  • As far as I know the number of partitions is mostly a tuning factor. If it is too small parallelism is not fully exploited. If it is too large communication overhead gets to high. However, the [Javadoc on the Partitioner class](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L51) suggests that OOMs are an indicator of bad partitioning. Are you seeing any other exit codes than the ones listed above? – Christoph Böhme Aug 09 '17 at 06:11
  • 1
    Very well done. I think one extra point to add would be the source of the signals: Usually this would be the cluster resource manager (container size exceeded, job finished, dynamic scale-down, job aborted by user), the operating system (controlled system shutdown, out of memory, over hard quota, no space left on disk, etc.), or a local user (kill). There should be no other entities sending signals to Spark processes. – Rick Moritz Aug 09 '17 at 08:10
  • That's very good addition indeed. I incorporated it in the answer. I hope this and the way it is attributed is okay for you. – Christoph Böhme Aug 10 '17 at 17:18
  • okay. Am I the only one who wees `Exit code 143` more frequently than any other exit code? – makansij Aug 10 '17 at 23:37
  • @ChristophBöhme anyway congrats awesome answer – makansij Aug 12 '17 at 04:46
  • Would you mind elaborating a little bit on `exit code 15`? when you say "user code" I assume you mean like a custom UDF or something? In which case, why does SIGTERM 15 usually mean an OOM and not necessarily a custom UDF? – makansij Aug 12 '17 at 05:25
  • Also, the difference between 1) main class of the launch environment and 2) main user class. What is the diff between those? – makansij Aug 12 '17 at 05:32
  • Also, I'm guessing `SIGKILL` and `SIGTERM` are the same thing? – makansij Aug 12 '17 at 22:54
  • @guimption Not quite: SIGTERM asks a process to please terminate itself. A process can ignore this request. SIGKILL on the other simply kills a process without asking the process. – Christoph Böhme Aug 13 '17 at 06:46
  • I see. I've noticed that `YY/mm/dd HH:MM:ss INFO yarn.ApplicationMaster: Registered signal handlers for [TERM, HUP, INT]` shows up in my yarn logs frequently and I'm wondering what that has to do with it. – makansij Aug 14 '17 at 22:36
  • These log message just say that the ApplicationMaster wants to receive the signals SIGTERM, SIGHUP and SIGINT so that it can react to them. Spark seems to install a signal handler for these three signals which simply logs a "SIGNAL RECEIVED" message. – Christoph Böhme Aug 15 '17 at 05:29
  • 1
    Great. I also am curious what it means when the exit status is less than 0? I editted this into my question. thanks. – makansij Aug 15 '17 at 15:18
  • 1
    @ChristophBöhme @RickMoritz @Sother I found this post because I got the exact same `Exit status: -100...*lost* node` as the OP. This is a great answer but I see no mention of that. Any ideas about these negative exit codes (other than the `-1` indicating a CLI error)? Thanks! – seth127 Mar 26 '19 at 13:44
  • @ChristophBöhme Is there anyway to return a custom return code from the spark job. We have written a validation job in spark and we want to pass couple of return codes indicating the type of validation error, to the shell script which invokes the spark-submit. We are using System.exit(1001) to return exit code – Albin Jun 07 '19 at 06:26
  • Can anyone make me understand about exitCode: -1000 Caused by: org.apache.spark.SparkException: Application application_1612838707347_1556 failed 2 times due to AM Container for appattempt_1612838707347_1556_000002 exited with exitCode: -1000 – Deepak Feb 10 '21 at 19:27