20

I'm running a very simple Spark job on AWS EMR and can't seem to get any log output from my script.

I've tried with printing to stderr:

from pyspark import SparkContext
import sys

if __name__ == '__main__':
    sc = SparkContext(appName="HelloWorld")
    print('Hello, world!', file=sys.stderr)
    sc.stop()

And using the spark logger as shown here:

from pyspark import SparkContext

if __name__ == '__main__':
    sc = SparkContext(appName="HelloWorld")

    log4jLogger = sc._jvm.org.apache.log4j
    logger = log4jLogger.LogManager.getLogger(__name__)
    logger.error('Hello, world!')

    sc.stop()

EMR gives me two log files after the job runs: controller and stderr. Neither log contains the "Hello, world!" string. It's my understanding the stdout is redirected to stderr in spark. The stderr log shows that the job is accepted, run, and completed successfully.

So my question is, where can I view my script's log output? Or what should I change in my script to log correctly?

Edit: I used this command to submit the step:

aws emr add-steps --region us-west-2 --cluster-id x-XXXXXXXXXXXXX --steps Type=spark,Name=HelloWorld,Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=true,s3a://path/to/simplejob.py],ActionOnFailure=CONTINUE
Community
  • 1
  • 1
jarbaugh
  • 423
  • 4
  • 10
  • What parameters did you submit the job to EMR with? – Ian Leaman Mar 06 '17 at 01:13
  • 3
    I've found that logging for particular steps almost never winds up in the controller or stderr logs that the EMR console pulls alongside the step. Usually I find what I want in the job's container logs (and usually it's in stdout). They're typically at a path like `s3://mybucket/logs/emr/spark/j-XXXXXX/containers/application_XXXXXXXXX/container_XXXXXXX/...` – Greg Reda Mar 06 '17 at 01:32
  • 1
    @GregReda I found the logs there. Thank you! If you post it as the answer I'll accept it. – jarbaugh Mar 06 '17 at 01:41
  • Great! Glad all my time debugging EMR + PySpark hasn't been for nothing :) – Greg Reda Mar 06 '17 at 01:45

4 Answers4

16

I've found that EMR's logging for particular steps almost never winds up in the controller or stderr logs that get pulled alongside the step in the AWS console.

Usually I find what I want in the job's container logs (and usually it's in stdout).

These are typically at a path like s3://mybucket/logs/emr/spark/j-XXXXXX/containers/application‌​_XXXXXXXXX/container‌​_XXXXXXX/.... You might need to poke around within the various application_... and container_... directories within containers.

That last container directory should have a stdout.log and stderr.log.

Greg Reda
  • 1,484
  • 1
  • 12
  • 20
1

For what it worth. Let j-XXX be the ID of the EMR cluster and assume it is configured to use logs_bucket for persisting logs on S3. If you want to find the logs emitted by your code do the following:

  1. In AWS console, find the step which you want to review
  2. Go to is stderr and search for application_. Take a note of the full name you find, it should be something like application_15489xx175355_0yy5.
  3. Go to s3://logs_bucket/j-XXX/containers and find the folder application_15489xx175355_0yy5.
  4. In this folder, you will find at least one folder named application_15489xx175355_0yy5_ww_vvvv. In these folders you will find files named stderr.gz which contain the logs emitted by your code.
Dror
  • 9,918
  • 17
  • 70
  • 137
  • sir can you please tell me how to store the custom logs to s3 from EMR cluster now i'm able to get all the system logs to s3 bucket but i have a log file which stores all my program logs that's i'm not able to find in s3 bucket but i'm able on EMR which store in this path :/mnt/var/log/hadoop/steps/N please give me a suggestion how to achieve this...thnx in advance – snehil singh Aug 28 '19 at 06:51
  • 1
    I wish I knew the answer... Sorry. – Dror Aug 28 '19 at 07:24
0

To capture the output of the script you can try something like below as well

/usr/bin/spark-submit --master yarn --num-executors 300 myjob.py param1 > s3://databucket/log.out 2>&1 &

This will write the script output into a log file at s3 location.

braj
  • 2,055
  • 25
  • 35
0

I'm using emr-5.30.1 running in YARN client mode and got this working using the Python logging library.

I didn't like solutions which used the JVM private methods in Spark. Apart from being a private method these caused my application logs to appear in the Spark logs (which are already quite verbose) and furthermore force me to use Spark's logging format.

Sample code using logging:

import logging

logging.basicConfig(
    format="""%(asctime)s,%(msecs)d %(levelname)-8s[%(filename)s:%(funcName)s:%(lineno)d] %(message)s""",
    datefmt="%Y-%m-%d %H:%M:%S",
    level=logging.INFO,
)

if __name__ == '__main__':
    logging.info('test')
    ...

When the cluster is created, I specify LogUri='s3://mybucket/emr/' via the console / CLI / boto.

Log output appears in stdout.gz of the relevant step, which can be found using either of the below options.

  1. In the EMR Console choose your Cluster. On the "Summary" tab, click the tiny folder icon next to "Log URI". Within the popup, navigate to steps, choose your step id, and open stdout.gz

  2. In S3 navigate to the logs directly. They are located at emr/j-<cluster-id>/steps/s-<step-id>/stdout.gz in mybucket.

Bjorn
  • 451
  • 4
  • 7