In Kedro, we can pipeline different nodes and partially run some nodes. When we are partially running some nodes, we need to save some inputs from the nodes somewhere so that when another node is run it can access the data that the previous node has generated. However, in which file do we write the code for this - pipeline.py, run.py or nodes.py?
For instance, I am trying to save a dir path directly to the DataCatalog under a variable name 'model_path'.
Snippet from pipeline.py:
# A mapping from a pipeline name to a ``Pipeline`` object.
def create_pipelines(**kwargs) -> Dict[str, Pipeline]:
io = DataCatalog(dict(
model_path=MemoryDataSet()
))
io.save('model_path', "data/06_models/model_test")
print('****', io.exists('model_path'))
pipeline = Pipeline([
node(
split_files,
["data_csv", "parameters"],
["train_filenames", "val_filenames", "train_labels", "val_labels"],
name="splitting filenames"
),
# node(
# create_and_train,
# ["train_filenames", "val_filenames", "train_labels", "val_labels", "parameters"],
# "model_path",
# name="Create Dataset, Train and Save Model"
# ),
node(
validate_model,
["val_filenames", "val_labels", "model_path"],
None,
name="Validate Model",
)
]).decorate(decorators.log_time, decorators.mem_profile)
return {
"__default__": pipeline
}
However, I get the following error when I Kedro run:
ValueError: Pipeline input(s) {'model_path'} not found in the DataCatalog