3

Let's say I create a Dagster pipeline with the following solids:

  1. Execute SQL query from file and get results
  2. Write results to a table

I want to do this for say 10 different tables in parallel. Each table requiring a different SQL query. What would be the best approach?

Binh Pham
  • 33
  • 4

1 Answers1

4

One approach is to use a solid factory. run_query_solid_factory and write_results_solid_factory are solid factories that take inputs (such as name and the query or table) and return a solid that can run in the pipeline. summary_report waits for all the upstream solids to complete before printing out summary info.

def run_query_solid_factory(name, query):
    @solid(name=name)
    def _run_query(context):
        context.log.info(query)
        return 'result'

    return _run_query

def write_results_solid_factory(name, table):
    @solid(name=name)
    def _write_results(context, query_result):
        context.log.info(table)
        context.log.info(query_result)
        return 'success'

    return _write_results

@solid
def summary_report(context, statuses):
    context.log.info(' '.join(statuses))

@pipeline
def pipeline():
    solid_output_handles = []
    queries = [('table', 'query'), ('table2', 'query2')]
    for table, query in queries:
        get_data = run_query_solid_factory('run_query_{}'.format(query), query)
        write_results = write_results_solid_factory('write_results_to_table_{}'.format(table), table)
        solid_output_handles.append(write_results(get_data()))

    summary_report(solid_output_handles)

Dag structure Dag execution logs

Previous Answer:

I would recommend creating a composite_solid that consists of a solid that handles (1) and a solid that handles (2). Then, you can alias the composite_solid once for each of the 10 tables, which will let you pass in the SQL query via config (see tutorial)

Catherine Wu
  • 176
  • 6