tl;dr
This is not natively supported now but will be natively supported in a future version of Cloud Dataproc. That said, there is a manual workaround in the interim.
Workaround
First, make sure you are sending the python logs to the correct log4j logger from the spark context. To do this declare your logger as:
import pyspark
sc = pyspark.SparkContext()
logger = sc._jvm.org.apache.log4j.Logger.getLogger(__name__)
The second part involves a workaround that isn't natively supported yet. If you look at the spark properties file under
/etc/spark/conf/log4j.properties
on the master of your cluster, you can see how log4j is configured for spark. Currently it looks like the following:
# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
# Settings to quiet third party logs that are too verbose
...
Note that this means log4j logs are sent only to the console. The dataproc agent will pick up this output and return it as the job driver ouput. However in order for fluentd to pick up the output and send it to Google Cloud Logging, you will need log4j to write to a local file. Therefore you will need to modify the log4j properties as follows:
# Set everything to be logged to the console and a file
log4j.rootCategory=INFO, console, file
# Set up console appender.
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
# Set up file appender.
log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.file.File=/var/log/spark/spark-log4j.log
log4j.appender.file.MaxFileSize=512KB
log4j.appender.file.MaxBackupIndex=3
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.conversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n
# Settings to quiet third party logs that are too verbose
...
If you set the file to /var/log/spark/spark-log4j.log as shown above, the default fluentd configuration on your Dataproc cluster should pick it up. If you want to set the file to something else you can follow the instructions in this question to get fluentd to pick up that file.