9

I am looking at several open source workflow schedulers for a DAG of jobs with heterogeneous RAM usage. The scheduler should not only schedule less than a maximum number of threads, but should also keep the total amount of RAM of all concurrent tasks below the available memory.

In this Luigi Q&A, it was explained that

You can set how many of the resource is available in the config, and then how many of the resource the task consumes as a property on the task. This will then limit you to running n of that task at a time.

in config:

[resources]
api=1

in code for Task:

resources = {"api": 1}

For Airflow, I haven't been able to find the same functionality in its docs. The best that seems possible is to specify a number of available slots in a resource pool, and to also specify that a task instance uses a single slot in a resource pool. However, it appears there is no way to specify that a task instance uses more than one slot in a pool.

Question: specifically for Airflow, how can I specify a quantitative resource usage of a task instance?

TemplateRex
  • 65,583
  • 16
  • 147
  • 283
  • 1
    You could limit concurrency if your airflow uses Celery as an executor. Specifically, https://stackoverflow.com/questions/44979811/adding-extra-celery-configs-to-airflow has some details on it and the parameter you're looking for is CELERYD_CONCURRENCY – bartgras Sep 07 '18 at 03:55
  • 1
    @bartgras consider making your comment into an answer – TemplateRex Sep 12 '18 at 06:17

1 Answers1

3

Assuming you're using CeleryExecutor, then starting from airflow version 1.9.0 you can manage Celery's tasks concurrency. This is not exactly memory management you've been asking about but number of concurrent worker's threads executing tasks.

Tweakable parameter is called CELERYD_CONCURRENCY and here is very nicely explained how to manage celery related config in Airflow.

[Edit]

Actually, Pools could also be used to limit concurrency. Let's say you want to limit resource hungry task_id so that only 2 instances will be run at the same time. The only thing you need to do is:

  • create pool (in UI: Admin -> Pools) assign it name e.g. my_pool and define task's concurrency in field Slots (in this case 2)

  • when instantiating your Operator that will execute this task_id, pass defined pool name (pool=my_pool)

bartgras
  • 432
  • 3
  • 13