1

I am very new to Dagster and I can't find answer to my question in the docs.

I have 2 solids: one thats yielding tuples(str, str) that are parsed from XML file, he other one just consumes tuples and stores objects in DB with according fields set. However I am running into an error Core compute for solid returned an output multiple times. I am pretty sure I made fundamental mistake in my design. Could someone explain to me how to design this pipeline in the right way or point me to the chapter in the docs that explains this error?


@solid(output_defs=[OutputDefinition(Tuple, 'classification_data')])
def extract_classification_from_file(context, xml_path: String) -> Tuple:
    context.log.info(f"start")
    root = ET.parse(xml_path).getroot()
    for code_node in root.findall('definition-item'):
        context.log.info(f"{code_node.find('classification-symbol').text} {code_node.find('definition-title').text}")
        yield Output((code_node.find('classification-symbol').text, code_node.find('definition-title').text), 'classification_data')


@solid()
def load_classification(context, classification_data):
    cls = CPCClassification.objects.create(code=classification_data[0], description=classification_data[1]).save()

@pipeline
def define_classification_pipeline():
    load_classification(extract_classification_from_file())
alexisdevarennes
  • 4,602
  • 3
  • 21
  • 37

2 Answers2

1

After looking at the dagster codebase for your error, which I found here. It confirmed what I read in the tutorial that "Output names must be unique".

Given that you are declaring Output in a for-loop and the error you have received, it's likely that your Output object name is not unique.


UPDATE: From the outreach you made to dagster by opening an issue, I tested the idea of creating Outputs dynamically at run-time and it works fine if you define your dynamic code outside of a @solid. I did find that when attempting to build my dynamic data within a @solid with the intention of using its output as solid configuration input to a successor @solid that the successor @solid didn't pick up the updated structure. The result was me receiving an dagster.core.errors.DagsterInvariantViolationError

Below is my code to validate dynamic Output yielding at runtime when executing the dynamic-data generation outside of a solid. I'm guessing this might be a bit of an anti-pattern, but maybe not quite yet if Dagster isn't at the maturity level quite yet to handle the scenario you bring up. Also note that something I didn't handle is doing something with all of the yielded Output objects.

"""dagit -f dynamic_output_at_runtime.py -n dynamic_output_at_runtime"""
import random

from dagster import (
    Output,
    OutputDefinition,
    execute_pipeline,
    pipeline,
    solid,
    SystemComputeExecutionContext
)

# Create some dynamic OutputDefinition list for each execution
start = 1
stop = 100
limit = random.randint(1, 10)
random_set_of_ints = {random.randint(start, stop) for iter in range(limit)}
output_defs_runtime = [OutputDefinition(
    name=f'output_{num}') for num in random_set_of_ints]


@solid(output_defs=output_defs_runtime)
def ints_for_all(context: SystemComputeExecutionContext):
    for num in random_set_of_ints:
        out_name = f"output_{num}"
        context.log.info(f"output object name: {out_name}")
        yield Output(num, out_name)

@pipeline
def dynamic_output_at_runtime():
    x = ints_for_all()
    print(x)

if __name__ == '__main__':
    result = execute_pipeline(dynamic_output_at_runtime)
    assert result.success

The result of me re-running this pipeline are different Output yields each time:

python dynamic_output_at_runtime.py 
_ints_for_all_outputs(output_56=<dagster.core.definitions.composition.InvokedSolidOutputHandle object at 0x7fb899cea160>, output_8=<dagster.core.definitions.composition.InvokedSolidOutputHandle object at 0x7fb899cea198>, output_58=<dagster.core.definitions.composition.InvokedSolidOutputHandle object at 0x7fb899cea1d0>, output_35=<dagster.core.definitions.composition.InvokedSolidOutputHandle object at 0x7fb899cea208>)
2019-11-27 08:33:32 - dagster - DEBUG - dynamic_output_at_runtime - a1273816-16b0-439b-ae32-dbd819f65b9a - PIPELINE_START - Started execution of pipeline "dynamic_output_at_runtime".
2019-11-27 08:33:32 - dagster - DEBUG - dynamic_output_at_runtime - a1273816-16b0-439b-ae32-dbd819f65b9a - ENGINE_EVENT - Executing steps in process (pid: 9456)
 event_specific_data = {"metadata_entries": [["pid", null, ["9456"]], ["step_keys", null, ["{'ints_for_all.compute'}"]]]}
2019-11-27 08:33:32 - dagster - DEBUG - dynamic_output_at_runtime - a1273816-16b0-439b-ae32-dbd819f65b9a - STEP_START - Started execution of step "ints_for_all.compute".
               solid = "ints_for_all"
    solid_definition = "ints_for_all"
            step_key = "ints_for_all.compute"
2019-11-27 08:33:32 - dagster - INFO - system - a1273816-16b0-439b-ae32-dbd819f65b9a - output object name: output_56
               solid = "ints_for_all"
    solid_definition = "ints_for_all"
            step_key = "ints_for_all.compute"
2019-11-27 08:33:32 - dagster - DEBUG - dynamic_output_at_runtime - a1273816-16b0-439b-ae32-dbd819f65b9a - STEP_OUTPUT - Yielded output "output_56" of type "Any". (Type check passed).
 event_specific_data = {"intermediate_materialization": null, "step_output_handle": ["ints_for_all.compute", "output_56"], "type_check_data": [true, "output_56", null, []]}
               solid = "ints_for_all"
    solid_definition = "ints_for_all"
            step_key = "ints_for_all.compute"
2019-11-27 08:33:32 - dagster - INFO - system - a1273816-16b0-439b-ae32-dbd819f65b9a - output object name: output_8
               solid = "ints_for_all"
    solid_definition = "ints_for_all"
            step_key = "ints_for_all.compute"
2019-11-27 08:33:32 - dagster - DEBUG - dynamic_output_at_runtime - a1273816-16b0-439b-ae32-dbd819f65b9a - STEP_OUTPUT - Yielded output "output_8" of type "Any". (Type check passed).
 event_specific_data = {"intermediate_materialization": null, "step_output_handle": ["ints_for_all.compute", "output_8"], "type_check_data": [true, "output_8", null, []]}
               solid = "ints_for_all"
    solid_definition = "ints_for_all"
            step_key = "ints_for_all.compute"
2019-11-27 08:33:32 - dagster - INFO - system - a1273816-16b0-439b-ae32-dbd819f65b9a - output object name: output_58
               solid = "ints_for_all"
    solid_definition = "ints_for_all"
            step_key = "ints_for_all.compute"
2019-11-27 08:33:32 - dagster - DEBUG - dynamic_output_at_runtime - a1273816-16b0-439b-ae32-dbd819f65b9a - STEP_OUTPUT - Yielded output "output_58" of type "Any". (Type check passed).
 event_specific_data = {"intermediate_materialization": null, "step_output_handle": ["ints_for_all.compute", "output_58"], "type_check_data": [true, "output_58", null, []]}
               solid = "ints_for_all"
    solid_definition = "ints_for_all"
            step_key = "ints_for_all.compute"
2019-11-27 08:33:32 - dagster - INFO - system - a1273816-16b0-439b-ae32-dbd819f65b9a - output object name: output_35
               solid = "ints_for_all"
    solid_definition = "ints_for_all"
            step_key = "ints_for_all.compute"
2019-11-27 08:33:32 - dagster - DEBUG - dynamic_output_at_runtime - a1273816-16b0-439b-ae32-dbd819f65b9a - STEP_OUTPUT - Yielded output "output_35" of type "Any". (Type check passed).
 event_specific_data = {"intermediate_materialization": null, "step_output_handle": ["ints_for_all.compute", "output_35"], "type_check_data": [true, "output_35", null, []]}
               solid = "ints_for_all"
    solid_definition = "ints_for_all"
            step_key = "ints_for_all.compute"
2019-11-27 08:33:32 - dagster - DEBUG - dynamic_output_at_runtime - a1273816-16b0-439b-ae32-dbd819f65b9a - STEP_SUCCESS - Finished execution of step "ints_for_all.compute" in 2.17ms.
 event_specific_data = {"duration_ms": 2.166192003642209}
               solid = "ints_for_all"
    solid_definition = "ints_for_all"
            step_key = "ints_for_all.compute"
2019-11-27 08:33:32 - dagster - DEBUG - dynamic_output_at_runtime - a1273816-16b0-439b-ae32-dbd819f65b9a - ENGINE_EVENT - Finished steps in process (pid: 9456) in 3.11ms
 event_specific_data = {"metadata_entries": [["pid", null, ["9456"]], ["step_keys", null, ["{'ints_for_all.compute'}"]]]}
2019-11-27 08:33:32 - dagster - DEBUG - dynamic_output_at_runtime - a1273816-16b0-439b-ae32-dbd819f65b9a - PIPELINE_SUCCESS - Finished execution of pipeline "dynamic_output_at_runtime".

I hope this helps!

  • I feel like I am missing something, is there no way to make solid act asynchronously? Do I have to define all my outputs beforehand or just return list with values, whats the point of using yield statement then? – Yevgen Ponomarenko Nov 26 '19 at 21:28
  • See my updated answer above. It doesn't answer your question on the asynchronous aspect of solids, but I think it helps with other aspects. – Tyler Seader Nov 27 '19 at 14:46
  • Wow, thanks for the update. I think this somehow solves the problem of yielding results from solids. I will try it out and see how it works out. – Yevgen Ponomarenko Nov 27 '19 at 14:53
  • Note that I added yet another update... I wasn't exactly following what you were doing. You were attempting to create dynamic output from an upstream `@solid` where I was not. I attempted to align more closely to your approach and faced issues with an upstream `@solid` output being used as a downstream `@solid` output_defs. This behavior might be occurring due to the way a pipeline is built and defined before it actually runs. – Tyler Seader Nov 27 '19 at 15:02
0

What I was doing here, is not supported behavior. Official statement issue