0

I have been learning airflow and writing DAGs for an ETL pipeline. It involves using the AWS environment (S3, Redshift). It deals with copying data from one bucket to another after storing it in redshift. I am storing bucket names and prefixes as Variables in airflow for which you have to open the GUI and add them manually.

Which is the most safest and widely used practice in the industry out of the following options

  • Can we use airflow.cfg to store our variables (bucket names) and access them in our DAGs?
  • Use a custom configuration file and parse its contents using configparser
  • Use the GUI to add variables
Command
  • 395
  • 1
  • 3
  • 16
  • I think it depends on your usage cases. Do you expect the bucket names and prefixes change often and you have the flexibility go to GUI to update variables without a PR (issue could be people accidentally update/remove it from UI)? Or do you want to keep track the changes in git but every time you make a change to add the PR – Chengzhi Aug 12 '19 at 21:18
  • The variables might not change that often. I don't want to open the GUI and add variables. Is there a way to store the variables data somewhere and make the GUI update automatically? – Command Aug 12 '19 at 21:31
  • You can potentially have a list of your variables as json and during airflow CICD process use airflow cli command to perform an update, check `airflow variables -i `https://airflow.apache.org/cli.html#variables, – Chengzhi Aug 12 '19 at 21:35
  • I think that solves my problem and I should probably reframe my question. I'll do that. Can you explain as to what the keys and values of the JSON file will be and where to store it in the answer. I'll select that as the right answer. @Chengzhi – Command Aug 12 '19 at 22:01
  • Also see [With code, how do you update an airflow variable?](https://stackoverflow.com/q/54045048/3679900) – y2k-shubham Sep 18 '20 at 17:04

2 Answers2

2

To summary: you can use airflow cli to perform an import operation of variables from a json file. You could use the following command airflow variables -i[1] and build it via airflow CICD pipeline or manually run it. That should handle the insert/update case. For deletion, you can call airflow variables -x explicitly, I don't think currently you can do a batch delete in airflow now.

You can have a JSON file looks like the following format with key value:

{
    "foo1": "bar1",
    "foo2": "bar2"
}

One thing to note here: you can treat the variable as key-value storage, so make sure you don't have duplicated keys when you import (otherwise you might override it with unexpected result)

[1] airflow.apache.org/cli.html#variables

Chengzhi
  • 2,103
  • 2
  • 20
  • 34
0

Airflow uses SQLAlchemy models to play around with entities like Connection, Variable, Pool etc. Furthermore, it doesn't try to hide that from end-user in any way, meaning that you are free to manipulate these entities by exploiting the underlying SQLAlchemy magic.


If you intend to modify Variables programmatically (from within an Airflow task), take inspiration from here

Other helpful links for reference

y2k-shubham
  • 6,703
  • 7
  • 39
  • 85