I have a workflow that I'll describe as follows:
[ Dump(query) ] ---+
|
+---> [ Parquet(dump, schema) ] ---> [ Hive(parquet) ]
|
[ Schema(query) ] ---+
Where:
query
is a query to an RDBMSDump
dumps the resultquery
to a CSV filedump
Schema
runs thequery
and xcoms its schemaschema
Parquet
readscsv
and usesschema
to create a Parquet fileparquet
Hive
creates a Hive table based on the Parquet fileparquet
The reason behind this somehow convoluted workflow are due to constraints that cannot be solved and lie outside of the scope of the question (but yeah, it would ideally be much simpler than this).
My question is about rolling back the effects of a pipeline in case of failure.
These are the rollbacks that I would like to see happen in different conditions:
dump
should always be deleted, regardless the end result of the pipelineparquet
should be deleted if, for whatever reason, the Hive table creation fails
Representing this in a workflow, I'd probably put it down like this:
[ Dump(query) ] ---+
|
+---> [ Parquet(dump, schema) ] ---> [ Hive(parquet) ]
| | |
[ Schema(query) ] ---+ | |
v v
[ DeleteParquetOutput ] --> [ DeleteDumpOutput ]
Where the transition from Parquet
to DeleteParquetOutput
is performed only if an error occurs and the transitions going into DeleteDumpOutput
occur ignoring any failure from its dependencies.
This should solve it, but I believe that more complex pipelines could suffer greatly in increased complexity by this error handling logic.
Before moving on to more details, my question: could this be considered a good practice when it comes to dealing with errors in an Airflow pipeline? What could be a different (and possibly more sustainable) approach?
If you are further interested in how I would like to solve this, read on, otherwise feel free to answer and/or comment.
My take on error handling in a pipeline
Ideally, what I'd like to do would be:
- define a rollback procedure for each stage where it's relevant
- for each rollback procedure, define whether it should only happen in case of failure or in any case
- when the pipeline completes, reverse the dependency relationships and, starting from the last successful tasks, traverse the reversed DAG and run the relevant rollback procedures (where applicable)
- errors from rollback procedures should be logged but not taken into account to complete the rollback of the whole pipeline
- for the previous point to hold, each task should define a single effect whose rollback procedure can be described without referencing other tasks
Let's make a couple of examples with the given pipeline.
Scenario 1: Success
We reverse the DAG and fill each task with its mandatory rollback procedure (if any), getting this
+---> [ Dump: UNDO ]
|
[ Hive: None ] ---> [ Parquet: None ] ---+
^ |
| +---> [ Schema: None ]
+--- Start here
Scenario 2: Failure occurs at Hive
+---> [ Dump: UNDO ]
|
[ Hive: None ] ---> [ Parquet: UNDO (error) ] ---+
^ |
| +---> [ Schema: None ]
+--- Start here
Is there any way to represent something like this in Airflow? I would also be open to evaluating different workflow automation solutions, should they enable this kind of approach.