2

Is there a way to cache the output of the solids in the pipeline in such a way that if I run the same pipeline but with a slightly different configuration (think hyper-parameter tuning), certain initial steps in the pipelines that are unaffected by the configuration changes will not be executed multiple times?

Raw data -> CPU expensive preprocessing (A) -> model fitting (B) -> model

I want to be able to run A once, but multiple variations of B.

Is there an elegant way to do this in Dagster?

sophros
  • 8,714
  • 5
  • 30
  • 57
moomima
  • 915
  • 8
  • 11
  • https://stackoverflow.com/questions/59050671/core-compute-for-solid-returned-an-output-multiple-times here is some answers for you – muTheTechie Dec 06 '20 at 20:56

1 Answers1

3

I'm not aware of this functionality existing.

Dagster can re-run a solid when storage is set to filesystem, but haven't seen anything on caching like what you're describing. Could submit an issue to Dagster if not much traction here, and then report back

A few possible workarounds

  1. Perhaps an option for you would be to Materialize data and add logic into your Solids to check if that data exists at some location. If it does, you return that data, and if it doesn't you re-process. This pattern puts the burden on you to ensure that only desired files are persisted. Given the potential varying areas of mutability in the open-ended scenario, this might be the easiest option.
  2. I suppose you could hack together new pipelines after each experiment - only composed of Solids that need to run again, and introducing new Solids that read in data from files and output for your other Solids. The read-in-data Solids could just be one reusable and aliased Solid I suppose.