39

Hello people of the Earth! I'm using Airflow to schedule and run Spark tasks. All I found by this time is python DAGs that Airflow can manage.
DAG example:

spark_count_lines.py
import logging

from airflow import DAG
from airflow.operators import PythonOperator

from datetime import datetime

args = {
  'owner': 'airflow'
  , 'start_date': datetime(2016, 4, 17)
  , 'provide_context': True
}

dag = DAG(
  'spark_count_lines'
  , start_date = datetime(2016, 4, 17)
  , schedule_interval = '@hourly'
  , default_args = args
)

def run_spark(**kwargs):
  import pyspark
  sc = pyspark.SparkContext()
  df = sc.textFile('file:///opt/spark/current/examples/src/main/resources/people.txt')
  logging.info('Number of lines in people.txt = {0}'.format(df.count()))
  sc.stop()

t_main = PythonOperator(
  task_id = 'call_spark'
  , dag = dag
  , python_callable = run_spark
)

The problem is I'm not good in Python code and have some tasks written in Java. My question is how to run Spark Java jar in python DAG? Or maybe there is other way yo do it? I found spark submit: http://spark.apache.org/docs/latest/submitting-applications.html
But I don't know how to connect everything together. Maybe someone used it before and has working example. Thank you for your time!

zero323
  • 283,404
  • 79
  • 858
  • 880
Ruslan Lomov
  • 467
  • 1
  • 4
  • 9

3 Answers3

24

You should be able to use BashOperator. Keeping the rest of your code as is, import required class and system packages:

from airflow.operators.bash_operator import BashOperator

import os
import sys

set required paths:

os.environ['SPARK_HOME'] = '/path/to/spark/root'
sys.path.append(os.path.join(os.environ['SPARK_HOME'], 'bin'))

and add operator:

spark_task = BashOperator(
    task_id='spark_java',
    bash_command='spark-submit --class {{ params.class }} {{ params.jar }}',
    params={'class': 'MainClassName', 'jar': '/path/to/your.jar'},
    dag=dag
)

You can easily extend this to provide additional arguments using Jinja templates.

You can of course adjust this for non-Spark scenario by replacing bash_command with a template suitable in your case, for example:

bash_command = 'java -jar {{ params.jar }}'

and adjusting params.

zero323
  • 283,404
  • 79
  • 858
  • 880
  • 2
    if i am not mistaken, this means the Spark is being run on the same machine running Airflow? What about running on a separate Spark cluster? – cryanbhu Jul 25 '18 at 07:52
  • @cryanbhu If you mean driver then the answer is positive (as long as Spark runs in a client mode). You might want to take a look at [this question](https://stackoverflow.com/q/51177802), although it doesn't resolve the problem. – zero323 Jul 26 '18 at 12:46
22

Airflow as of version 1.8 (released today), has

SparkSQLHook code - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_sql_hook.py

SparkSubmitHook code - https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_submit_hook.py

Notice that these two new Spark operators/hooks are in "contrib" branch as of 1.8 version so not (well) documented.

So you can use SparkSubmitOperator to submit your java code for Spark execution.

Tagar
  • 10,563
  • 4
  • 78
  • 99
  • The SparkSQLOperator looks like it's just the thing I need - however, I can't get it to work because I don't know what the connection string should look like - is there any documentation anywhere that can help me with this? – s d Dec 12 '17 at 19:44
  • If you don't set it - connection will default to yarn execution mode - see https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_submit_hook.py#L33 – Tagar Dec 12 '17 at 21:12
  • can we run spark2-submit using Airflow? – Deepesh Rehi Jun 19 '18 at 12:23
  • @DeepeshRehi yes, that's what the `spark_binary` argument is for. See: https://github.com/apache/airflow/blob/d2f224992eb2f1db3e6520c45b65340d244925bd/airflow/contrib/hooks/spark_submit_hook.py#L90 – Oliver W. Jul 11 '19 at 19:19
13

There is an example of SparkSubmitOperator usage for Spark 2.3.1 on kubernetes (minikube instance):

"""
Code that goes along with the Airflow located at:
http://airflow.readthedocs.org/en/latest/tutorial.html
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
from airflow.models import Variable
from datetime import datetime, timedelta

default_args = {
    'owner': 'user@mail.com',
    'depends_on_past': False,
    'start_date': datetime(2018, 7, 27),
    'email': ['user@mail.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
    # 'queue': 'bash_queue',
    # 'pool': 'backfill',
    # 'priority_weight': 10,
    'end_date': datetime(2018, 7, 29),
}

dag = DAG(
    'tutorial_spark_operator', default_args=default_args, schedule_interval=timedelta(1))

t1 = BashOperator(
    task_id='print_date',
    bash_command='date',
    dag=dag)

print_path_env_task = BashOperator(
    task_id='print_path_env',
    bash_command='echo $PATH',
    dag=dag)

spark_submit_task = SparkSubmitOperator(
    task_id='spark_submit_job',
    conn_id='spark_default',
    java_class='com.ibm.cdopoc.DataLoaderDB2COS',
    application='local:///opt/spark/examples/jars/cppmpoc-dl-0.1.jar',
    total_executor_cores='1',
    executor_cores='1',
    executor_memory='2g',
    num_executors='2',
    name='airflowspark-DataLoaderDB2COS',
    verbose=True,
    driver_memory='1g',
    conf={
        'spark.DB_URL': 'jdbc:db2://dashdb-dal13.services.dal.bluemix.net:50001/BLUDB:sslConnection=true;',
        'spark.DB_USER': Variable.get("CEDP_DB2_WoC_User"),
        'spark.DB_PASSWORD': Variable.get("CEDP_DB2_WoC_Password"),
        'spark.DB_DRIVER': 'com.ibm.db2.jcc.DB2Driver',
        'spark.DB_TABLE': 'MKT_ATBTN.MERGE_STREAM_2000_REST_API',
        'spark.COS_API_KEY': Variable.get("COS_API_KEY"),
        'spark.COS_SERVICE_ID': Variable.get("COS_SERVICE_ID"),
        'spark.COS_ENDPOINT': 's3-api.us-geo.objectstorage.softlayer.net',
        'spark.COS_BUCKET': 'data-ingestion-poc',
        'spark.COS_OUTPUT_FILENAME': 'cedp-dummy-table-cos2',
        'spark.kubernetes.container.image': 'ctipka/spark:spark-docker',
        'spark.kubernetes.authenticate.driver.serviceAccountName': 'spark'
        },
    dag=dag,
)

t1.set_upstream(print_path_env_task)
spark_submit_task.set_upstream(t1)

The code using variables stored in Airflow variables: enter image description here

Also, you need to create a new spark connection or edit existing 'spark_default' with extra dictionary {"queue":"root.default", "deploy-mode":"cluster", "spark-home":"", "spark-binary":"spark-submit", "namespace":"default"}: enter image description here

Oliver W.
  • 11,999
  • 3
  • 29
  • 45
CTiPKA
  • 2,720
  • 20
  • 26
  • a little confused by the conf properties option in Airflow. from the above code it looks like custom key=value is getting passed to conf. how is that possible? maybe I'm not understanding this option but I thought it was meant only for spark configuration properties that are typically passed with `--conf` flag in spark-submits. – horatio1701d Aug 29 '18 at 10:52
  • @horatio1701d the `conf` keys it is just array of `--conf` keys that we passing to spark_submit. it can be k8s, spark or just our custom keys – CTiPKA Aug 30 '18 at 15:41
  • Strange there is no SparkSubmitHook example as this is now deprecated. I mean anywhere. – rjurney Jun 25 '20 at 02:48