How to use tf.data.Dataset with kedro?

Question

I am using tf.data.Dataset to prepare a streaming dataset which is used to train a tf.kears model. With kedro, is there a way to create a node and return the created tf.data.Dataset to use it in the next training node?

The MemoryDataset will probably not work because tf.data.Dataset cannot be pickled (deepcopy isn't possible), see also this SO question. According to issue #91 the deep copy in MemoryDataset is done to avoid modifying the data by some other node. Can someone please elaborate a bit more on why/how this concurrent modification could happen?

From the docs, there seems to be a copy_mode = "assign". Would it be possible to use this option in case the data is not picklable?

Another solution (also mentioned in issue 91) is to use just a function to generate the streaming tf.data.Dataset inside the training node, without having the preceding dataset generation node. However, I am not sure what the drawbacks of this approach will be (if any). Would be greate if someone could give some examples.

Also, I would like to avoid storing the complete output of the streaming dataset, for example using tfrecords or tf.data.experimental.save as these options would use a lot of disk storage.

Is there a way to pass just the created tf.data.Dataset object to use it for the training node?

score 0 · Answer 1 · answered Mar 19 '21 at 12:15

Providing workaround here for the benefit of community, though it is presented in kedro.community by @DataEngineerOne.

According to @DataEngineerOne.

With kedro, is there a way to create a node and return the created tf.data.Dataset to use it in the next training node?

Yes, absolutely!

Can someone please elaborate a bit more on why/how this concurrent modification could happen?

From the docs, there seems to be a copy_mode = "assign" . Would it be possible to use this option in case the data is not picklable?

I have yet to try this option, but it should theoretically work. All you would need to do is create a new dataset entry in the catalog.yml file that includes the copy_mode option.

Ex:

# catalog.yml
tf_data:
  type: MemoryDataSet
  copy_mode: assign

# pipeline.py
node(
  tf_generator,
  inputs=...,
  outputs="tf_data",
)

I can not vouch for this solution, but give it a go and let me know if it works for you.

Another solution (also mentioned in issue 91) is to use just a function to generate the streaming tf.data.Dataset inside the training node, without having the preceding dataset generation node. However, I am not sure what the drawbacks of this approach will be (if any). Would be greate if someone could give some examples.

This is also a great alternative solution, and I think (guess) that the MemoryDataSet will automatically use assign in this case, rather than its normal deepcopy, so you should be alright.

# node.py

def generate_tf_data(...):
  tensor_slices = [1, 2, 3]
  def _tf_data():
    dataset = tf.data.Dataset.from_tensor_slices(tensor_slices)
    return dataset
  return _tf_data

def use_tf_data(tf_data_func):
  dataset = tf_data_func()

# pipeline.py
Pipeline([
node(
  generate_tf_data,
  inputs=...,
  outputs='tf_data_func',
),
node(
  use_tf_data,
  inputs='tf_data_func',
  outputs=...
),
])

The only drawback here is the additional complexity. For more details you can refer here.

How to use tf.data.Dataset with kedro?

1 Answers1

Linked