13

I'm learning Airflow and have a simple question. Below is my DAG called dog_retriever:

import airflow
from airflow import DAG
from airflow.operators.http_operator import SimpleHttpOperator
from airflow.operators.sensors import HttpSensor
from datetime import datetime, timedelta
import json



default_args = {
    'owner': 'Loftium',
    'depends_on_past': False,
    'start_date': datetime(2017, 10, 9),
    'email': 'rachel@loftium.com',
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 3,
    'retry_delay': timedelta(minutes=3),
}

dag = DAG('dog_retriever',
    schedule_interval='@once',
    default_args=default_args)

t1 = SimpleHttpOperator(
    task_id='get_labrador',
    method='GET',
    http_conn_id='http_default',
    endpoint='api/breed/labrador/images',
    headers={"Content-Type": "application/json"},
    dag=dag)

t2 = SimpleHttpOperator(
    task_id='get_breeds',
    method='GET',
    http_conn_id='http_default',
    endpoint='api/breeds/list',
    headers={"Content-Type": "application/json"},
    dag=dag)
    
t2.set_upstream(t1)

As a means to test out Airflow, I'm simply making two GET requests to some endpoints in this very simple http://dog.ceo API. The goal is to learn how to work with some data retrieved via Airflow

The execution is working- my code successfully calls the endpoints in tasks t1 and t2, I can see them being logged in the Airflow UI, in the correct order based on the set_upstream rule I wrote.

What I cannot figure out is how to ACCESS the JSON response of these 2 tasks. It seems so simple, but I cannot figure it out. In the SimpleHtttpOperator I see a param for response_check, but nothing to simply print, or store, or view the JSON response.

Thanks.

Andrzej Sydor
  • 1,180
  • 4
  • 9
  • 21
Rachel Lanman
  • 349
  • 1
  • 4
  • 15
  • Hi, there! Did you figure out, how you can access data in task `t2` from the response of task `t1`? It would be great, if you could share this information. Chengzhi answer explains how to push and I get, but how to pull it in task `t2`? – Jacobian Feb 28 '20 at 11:16

2 Answers2

20

So since this is SimpleHttpOperator and the actual json is pushed to XCOM and you can get it from there. Here is the line of code for that action: https://github.com/apache/incubator-airflow/blob/master/airflow/operators/http_operator.py#L87

What you need to do is set xcom_push=True, so your first t1 will be the following:

t1 = SimpleHttpOperator(
    task_id='get_labrador',
    method='GET',
    http_conn_id='http_default',
    endpoint='api/breed/labrador/images',
    headers={"Content-Type": "application/json"},
    xcom_push=True,
    dag=dag)

You should be able to find all JSON with return value in XCOM, more detail of XCOM can be found at: https://airflow.incubator.apache.org/concepts.html#xcoms

Chengzhi
  • 2,103
  • 2
  • 20
  • 34
  • thank you @Chengzhi, this works. Though I think I'm going to simply use the PythonOperator from now on. – Rachel Lanman Oct 12 '17 at 20:43
  • 4
    @Chengzhi. Hi! can you please share how the second SimpleHttpOperator task `t2` may look like, which may use data from the first task. The problem is, I see myriads of examples, which say - just use xcom and push data, but they do not show the reciever part, or the other task, which may use data pushed by the previous one. – Jacobian Feb 28 '20 at 11:13
5

I'm adding this answer primarily for anyone who is trying to (or who wants to) call an Airflow workflow DAG from a process and to receive any data that results from the DAG's activity.

It is important to understand that an HTTP POST is required to run a DAG and that the response to this POST is hardcoded in Airflow, i.e. without changes to the Airflow code itself, Airflow will never return anything but a status code and message to the requesting process.

Airflow seems to be used primarily to create data pipelines for ETL (extract, transform, load) workflows, the existing Airflow Operators, e.g. SimpleHttpOperator, can get data from RESTful web services, process it, and write it to databases using other operators, but do not return it in the response to the HTTP POST that runs the workflow DAG.

Even if the operators did return this data in the response, looking at the Airflow source code confirms that the trigger_dag() method doesn’t check for or return it:

apache_airflow_airflow_www_api_experimental_endpoints.py

apache_airflow_airflow_api_client_json_client.py

All it does return is this confirmation message:

Airflow DagRun Message Received in Orchestration Service

Since Airflow is OpenSource, I suppose we could modify the trigger_dag() method to return the data, but then we’d be stuck maintaining the forked codebase, and we wouldn’t be able to use cloud-hosted, Airflow-based services like Cloud Composer on Google Cloud Platform because it wouldn’t include our modification.

Worse, Apache Airflow isn’t even returning its hard-coded status message correctly.

When we POST successfully to the Airflow /dags/{DAG-ID}/dag_runs endpoint, we receive a ’200 OK” response, not a “201 Created” response as we should. And Airflow “hard-codes” the Content body of the response with its “Created … ” status message. The standard, however, is to return the Uri of the newly created resource in the response header, not in the body … which would leave the body free to return any data produced/aggregated during (or resulting from) this creation.

I attribute this flaw to the “blind” (or what I call “naive”) Agile/MVP-driven approach, which only adds features that are asked for rather than remaining aware of and leaving room for more general utility. Since Airflow is overwhelmingly used to create data pipelines for (and by) data scientists (not software engineers), the Airflow operators can share data with each other using its proprietary, internal XCom feature as @Chengzhi 's helpful answer points out (thank you!) but cannot under any circumstances return data to the requester that kicked off the DAG, i.e. a SimpleHttpOperator can retrieve data from a third-party RESTful service and can share that data with a PythonOperator (via XCom) that enriches, aggregates, and/or transforms it. The PythonOperator can then share its data with a PostgresOperator that stores the result directly in a database. But the result cannot ever be returned to the process that requested that work be done, i.e. our Orchestration service, making Airflow useless for any use case but the one being driven by its current users.

The takeaways here (for me at least) are 1) never to attribute too much expertise to anyone or to any organization. Apache is an important organization with deep and vital roots in software development … but they’re not perfect. And 2) always beware of internal, proprietary solutions. Open, standards-based solutions have been examined and vetted from many different perspectives, not just one.

I lost nearly a week chasing down different ways to do what seemed a very simple and reasonable thing. I hope that this answer will save someone else some time.

Doug Wilson
  • 113
  • 1
  • 6