1

I am using WebHDFSSensor and for that we need to provide namenode. However, active namenode and standBy namenode change. I can't just provide current namenode host to webhdfs_conn_id. I have to create connection from both host. I tried to provide host as an array but it didn't work.

So my question here is , Lets consider I need connection with name webhdfs_default and I need it for 2 host w.x.y.z and a.b.c.d. How do I create that?

Ayush Goyal
  • 365
  • 1
  • 12
  • 1
    You can place a `PythonOperator` before your `WebHDFSSensor` that updates the `webhdfs_conn_id`. See [this](https://stackoverflow.com/q/51863881/3679900), [this](https://stackoverflow.com/q/53740885/3679900) and [this](https://www.google.com/search?q=programmatically+create+connections+in+airflow+site:stackoverflow.com). Also know that (not a solution) you can technically have multiple connections defined having `conn_id='webhdfs_conn_id'`, in which case Airflow will [randomly pick](https://airflow.apache.org/docs/stable/concepts.html?highlight=connection#connections) one of those – y2k-shubham Sep 07 '20 at 11:02
  • but we don't know in advance which server is going to be active and standby. – Ayush Goyal Sep 09 '20 at 06:05
  • If there's a pre determined set of IPs of servers (and given an IP, a way to find if this is active or standby server), then this entire logic of determining the current active server and updating Airflow's conn can be baked into a single PythonOperator that preceding WebHDFSSensor. (Alternatively you can have a separate DAG that runs every 5 min and updates the conn). Otherwise you'll have to build it in your HDFS deployment that whenever the active server changes, an event is published (to say, SNS) and over there you can (via a Lambda) call that IP updation DAG via Airflow's REST API – y2k-shubham Sep 09 '20 at 07:16
  • We have pre determined set ip's but my question is how to determine the standby and active namenode? the only difference between them is we can read and write from active namenode. – Ayush Goyal Sep 09 '20 at 08:25
  • 1
    `..the only difference between them is we can read and write from active namenode..` couldn't that be the answer? In `PythonOperator`, you attempt to *"read and write"* to the list of nodes (IPs) one-by-one; and whichever goes through is the active one (and you set it's IP in `webhdfs_conn_id`) – y2k-shubham Sep 09 '20 at 09:53
  • I will try this. – Ayush Goyal Sep 09 '20 at 10:00

0 Answers0