2

I have a DAG that creates a cluster, starts computation tasks, and after they completed, tears down the cluster. I want to limit concurrency for the computation tasks carried on this cluster to fixed number. So logically, I need a pool that is exclusive to the cluster created by a task. I don't want interference with other DAGs or different runs of the same DAG.

I thought I could solve this problem by creating a pool dynamically from a task after the cluster is created and delete it once the computation tasks are finished. I thought I could template the pool parameter of the computation tasks to make them use this dynamically created cluster.

# execute registers a pool and returns with the pool name
create_pool = CreatePoolOperator(
    slots=4,
    task_id='create_pool',
    dag=self
)

# the pool parameter is templated
computation = ComputeOperator(
    task_id=compute_subtask_name,
    pool="{{ ti.xcom_pull(task_ids='create_pool') }}",
    dag=self
)

create_pool >> computation

But this way the computqtion tasks will never be triggered. So I think the pool parameter is saved in the task instance before being templated. I would like to hear your thoughts on how to achieve the desired behavior.

Midiparse
  • 4,441
  • 4
  • 25
  • 44
  • I have provided you a custom solution. Please tweak the pseudo code with professional battle ready code that does not cause a "race condition" :) – Kyle Bridenstine Apr 25 '19 at 18:15
  • Also I was LITERALLY just in Budapest Hungary visiting my works office there!!! Maybe I ran into you lol. – Kyle Bridenstine Apr 25 '19 at 18:15
  • I don't think we met. Hope you had a great time though! :) – Midiparse Apr 28 '19 at 19:17
  • I've posted an answer below, why can't you know the pool name before you create the pool? Seems like that would solve your problem. – Dave C May 17 '19 at 19:41
  • why can't you know the pool name before you create the pool? Seems like that would solve your problem: becuase I want a new pool in each DagRun. This requires passing at lest the DagRun's id to name, which requires templating. The pool parameter cannot be templated – Midiparse May 18 '19 at 14:37

3 Answers3

0

Instead of trying to get a dynamic pool to work, see if the concurrency attribute on airflow.models.DAG will do the trick. It will limit the number of running tasks inside the current run of the process.

joebeeson
  • 3,669
  • 1
  • 18
  • 26
  • 1
    I know that configuration option already. However I need to control the concurrency of specific set of tasks, so this is not what I want. – Midiparse Oct 08 '18 at 12:34
  • However I realized that we only want 1 run of the same DAG at a time, so I set `max_active_runs` to 1, and upsert a pool for the DAG that can be used by the tasks. Nevertheless not an optimal solution.... – Midiparse Oct 08 '18 at 12:36
  • **@joeb**, **@Midiparse** okay at least I don't have to create pool dynamically, so do you this something like [this](https://gist.github.com/y2k-shubham/275d9679ba3dd1e99f3e0ad401f69920) would work? (done only at time of DAG deployment). Furthermore, can you think of a viable solution for [this](https://stackoverflow.com/questions/53740885/create-and-use-connections-in-airflow-operator-at-runtime)? – y2k-shubham Dec 17 '18 at 07:26
  • 1
    yes, that will work for global pools. However in that case it would be cleaner to set the pool up with some kind of init script simply [using the airflow cli](https://airflow.apache.org/cli.html#pool) eg from bash. Just so you won't run queries against the db needlessly. – Midiparse Dec 19 '18 at 14:36
0

This answer will probably aggravate some but it's one possible path nonetheless and so it's worth documenting. The core feature that makes Airflow more powerful then it's competitors is that everything is defined using code. At the end of the day if Airflow does not provide us with a feature we can always just create the feature ourselves using Python.

You want the ability to pool tasks in a DAG but only for that specific DAG run. So try to just create a custom pool on your tasks. Here's some pseudo code off the top of my head

List<String> tasksPoolQueue = new ArrayList<String>();

def taskOnesFunction() 

  while true:

    if tasksPoolQueue.get(0) == "taskOnesTurn":
       print("Do some work it's your turn")

       # Delete this run from the list and shift the list over to the left one index
       # So that the next value is now the first value in the list
       tasksPoolQueue.delete(0)

       return 0

    else:
      sleep(10 seconds)

def taskTwosFunction()

  while true:

    if tasksPoolQueue.get(0) == "taskTwosTurn":
       print("Do some work it's your turn")

       # Delete this run from the list and shift the list over to the left one index
       # So that the next value is now the first value in the list
       tasksPoolQueue.delete(0)

       return 0

    else:
      sleep(10 seconds)

def createLogicalOrderingOfTaskPoolQueue():

    if foobar == true:
      tasksPoolQueue[0] = "taskOnesTurn"
      tasksPoolQueue[1] = "taskTwosTurn"
    else:
      tasksPoolQueue[0] = "taskTwosTurn"
      tasksPoolQueue[1] = "taskOnesTurn"

    return 0


determine_pool_queue_ordering = PythonOperator(
    task_id='determine_pool_queue_ordering',
    retries=0,
    dag=dag,
    provide_context=True,
    python_callable=createLogicalOrderingOfTaskPoolQueue,
    op_args=[])

task1 = PythonOperator(
    task_id='task1',
    retries=0,
    dag=dag,
    provide_context=True,
    python_callable=taskOnesFunction,
    op_args=[])

task2= PythonOperator(
    task_id='task2',
    retries=0,
    dag=dag,
    provide_context=True,
    python_callable=taskTwosFunction,
    op_args=[])

determine_pool_queue_ordering.set_downstream(task1)
determine_pool_queue_ordering.set_downstream(task2)

So hopefully everyone can follow my pseudo code. I don't know what the best way of creating a custom pool would be that doesn't introduce a "race condition" so this list queue idea was what I came up with at first glance. But the main point here is that both task1 and task2 will run at the same time BUT inside their function I can make it so that the function doesn't do anything meaningful until it gets past that if statement preventing it from running the real code.

The first task will dynamically set which tasks run first and in what order using the list. Then have all the functions that need to be in this custom pool reference that list. Since our if statements only equal true when their taskName is first in the list it essentially means that only one task can run at a time. The first task in the list will delete itself from the list once it's done processing whatever it needs to do. Then the other tasks will sleep while they wait for their task name to be first in the list.

So just make some custom logic similar to mine.

Kyle Bridenstine
  • 4,285
  • 5
  • 42
  • 83
  • I just realized this answer may be affected by having a clustered environment where the tasks are running on different worker nodes (different servers) so you might need to tweak the tasksPoolQueue so that it's Global. You can do a number of things like use a database to store the information, use AWS S3 or SQS might work. Honestly there should be a plethora of ways for you to do it. You could even try using the BashOperator to run the Airflow Variable commands and try storing the information in an Airflow Variable. – Kyle Bridenstine Apr 26 '19 at 14:23
  • I didn't want to create custom pooling because I didn't want to ruin the transparency of the tasks. With your workaround, all tasks will appear on the UI as running (on the Gant charts as well), when in IRL they are blocked. Dynamic pooling should be supported intrinsically by airflow IMO. – Midiparse Apr 28 '19 at 19:09
0

Here is an operator that creates a pool if it doesn't exist.

from airflow.api.common.experimental.pool import get_pool, create_pool
from airflow.exceptions import PoolNotFound
from airflow.models import BaseOperator
from airflow.utils import apply_defaults


class CreatePoolOperator(BaseOperator):
    # its pool blue, get it?
    ui_color = '#b8e9ee'

    @apply_defaults
    def __init__(
            self,
            name,
            slots,
            description='',
            *args, **kwargs):
        super(CreatePoolOperator, self).__init__(*args, **kwargs)
        self.description = description
        self.slots = slots
        self.name = name

    def execute(self, context):
        try:
            pool = get_pool(name=self.name)
            if pool:
                self.log(f'Pool exists: {pool}')
                return
        except PoolNotFound:
            # create the pool
            pool = create_pool(name=self.name, slots=self.slots, description=self.description)
            self.log(f'Created pool: {pool}')

deleting the pool could be done in a similar manner.

Dave C
  • 107
  • 1
  • 12