1

I'm in the process of building a few pipelines in Airflow after having spent the last few years using AWS DataPipeline. I have a couple questions I'm foggy on and hope for some clarification. For context, I'm using Google Cloud Composer.

In DataPipeline, I would often create DAGs with a few tasks that would go something like this:

  1. Fetch data
  2. Transform data
  3. Write data somewhere

At each step along the way I could define an inputNode and/or an outputNode. These outputNodes would be mounted locally to the task runner, and any files that were written locally would be uploaded once the task finished to the s3 bucket defined as the outputNode.

Now, in Airflow, I don't think there's this same concept, right?

Q: Where do files go if I write them locally in an airflow task? I assume they just reside on the task runner, assuming it doesn't destroy itself after the task is finished?

It seems whereas in AWS DP I could mount an outputNode, do something like:

f = open("hello.txt", "a")
f.write("world")
f.close()

and when the task finished the file hello.txt would be uploaded to the s3 bucket. But in Airflow, if I did the same thing the file would just sit on the runner that ran the task?

Q: Should I be thinking about writing tasks differently? Seems like if my file needs to go somewhere I have to explicitly do it within the task. Follow up: if that's the case should I be deleting locally created files after I upload them to storage and or monitoring the amount of space that these files are taking on my runner?

Any recommended reading for someone who is migrating from AWS DP to Airflow, material you've found helpful would be greatly appreciated.

Thanks!

EDIT

As I continued researching, based on this documentation it seems like GCS and Composer do a similar thing. It seems like the /data directory in your composer environment is mounted on all the nodes in the cluster at /home/airflow/gcs/data.

Testing I was able to confirm that this is the case.

JW2
  • 179
  • 11

1 Answers1

1

Consider writing the data between tasks to a data lake (GCS) so that these tasks could be re-run at some future time... image if you wanted to change an algorithm and re-run the last step on a year's worth of historical data.

ryw
  • 8,597
  • 5
  • 24
  • 34