4

I am running a PySpark job using Cloud Dataproc, and want to log info using the logging module of Python. The goal is to then push these logs to Cloud Logging.

From this question, I learned that I can achieve this by adding a logfile to the fluentd configuration, which is located at /etc/google-fluentd/google-fluentd.conf.

However, when I look at the log files in /var/log, I cannot find the files that contain my logs. I've tried using the default python logger and the 'py4j' logger.

logger = logging.getLogger()
logger = logging.getLogger('py4j')

Can anyone shed some light as to which logger I should use, and which file should be added to the fluentd configuration?

Thanks

Community
  • 1
  • 1

1 Answers1

5

tl;dr

This is not natively supported now but will be natively supported in a future version of Cloud Dataproc. That said, there is a manual workaround in the interim.

Workaround

First, make sure you are sending the python logs to the correct log4j logger from the spark context. To do this declare your logger as:

import pyspark
sc = pyspark.SparkContext()
logger = sc._jvm.org.apache.log4j.Logger.getLogger(__name__)

The second part involves a workaround that isn't natively supported yet. If you look at the spark properties file under

/etc/spark/conf/log4j.properties

on the master of your cluster, you can see how log4j is configured for spark. Currently it looks like the following:

# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n

# Settings to quiet third party logs that are too verbose
...

Note that this means log4j logs are sent only to the console. The dataproc agent will pick up this output and return it as the job driver ouput. However in order for fluentd to pick up the output and send it to Google Cloud Logging, you will need log4j to write to a local file. Therefore you will need to modify the log4j properties as follows:

# Set everything to be logged to the console and a file
log4j.rootCategory=INFO, console, file
# Set up console appender.
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n

# Set up file appender.
log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.file.File=/var/log/spark/spark-log4j.log
log4j.appender.file.MaxFileSize=512KB
log4j.appender.file.MaxBackupIndex=3
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.conversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n

# Settings to quiet third party logs that are too verbose
...

If you set the file to /var/log/spark/spark-log4j.log as shown above, the default fluentd configuration on your Dataproc cluster should pick it up. If you want to set the file to something else you can follow the instructions in this question to get fluentd to pick up that file.

Community
  • 1
  • 1
Lauren
  • 51
  • 2
  • After adding the file appender settings to the configuration file, and subsequently resetting my DataProc cluster and running a job, I can not find my logs in Cloud Logging. However, I've noticed that the `spark-log4j.log` file is not present in the `/var/log/spark/` directory. Do I need to do any other steps in order to make log4j to log to that file. Furthermore, I don't see how this approach will work on the executor nodes, as they don't have access to the spark context in order to connect to the JVM. – Gilles Jacobs Dec 17 '15 at 09:05