Caching Dagster's pipeline results

Question

Is there a way to cache the output of the solids in the pipeline in such a way that if I run the same pipeline but with a slightly different configuration (think hyper-parameter tuning), certain initial steps in the pipelines that are unaffected by the configuration changes will not be executed multiple times?

Raw data -> CPU expensive preprocessing (A) -> model fitting (B) -> model

I want to be able to run A once, but multiple variations of B.

Is there an elegant way to do this in Dagster?

https://stackoverflow.com/questions/59050671/core-compute-for-solid-returned-an-output-multiple-times here is some answers for you — muTheTechie, Dec 06 '20 at 20:56

score 3 · Answer 1 · answered Dec 11 '19 at 20:36

I'm not aware of this functionality existing.

Dagster can re-run a solid when storage is set to filesystem, but haven't seen anything on caching like what you're describing. Could submit an issue to Dagster if not much traction here, and then report back

A few possible workarounds

Perhaps an option for you would be to Materialize data and add logic into your Solids to check if that data exists at some location. If it does, you return that data, and if it doesn't you re-process. This pattern puts the burden on you to ensure that only desired files are persisted. Given the potential varying areas of mutability in the open-ended scenario, this might be the easiest option.
I suppose you could hack together new pipelines after each experiment - only composed of Solids that need to run again, and introducing new Solids that read in data from files and output for your other Solids. The read-in-data Solids could just be one reusable and aliased Solid I suppose.

These are good options, I ended up writing a decorator for the solids that caches the return value. — moomima, Dec 15 '19 at 08:05
@moomima , would you mind , sharing the decorator code that you written? — Atif, Nov 26 '20 at 06:53

Caching Dagster's pipeline results

1 Answers1

A few possible workarounds