12

I want to stack two datasets objects in Tensorflow (rbind function in R). I have created one dataset A from tfRecord files and one dataset B from numpy arrays. Both have same variables. Do you know if there is a way to stack these two datasets to create a bigger one ? Or to create an iterrator that will randomly read data from this two sources ?

Thanks

Kent930
  • 131
  • 1
  • 1
  • 5

2 Answers2

22

The tf.data.Dataset.concatenate() method is the closest analog of tf.stack() when working with datasets. If you have two datasets with the same structure (i.e. same types for each component, but possibly different shapes):

dataset_1 = tf.data.Dataset.range(10, 20)
dataset_2 = tf.data.Dataset.range(60, 70)

then you can concatenate them as follows:

combined_dataset = dataset_1.concatenate(dataset_2)
Frodon
  • 3,309
  • 1
  • 12
  • 31
mrry
  • 120,078
  • 23
  • 381
  • 391
  • 4
    Adding to mrry's answer, there's also https://www.tensorflow.org/api_docs/python/tf/data/Dataset#interleave which allows you to merge datasets instead of concatenate datasets. You can then `Dataset.shuffle()` to randomize a batch of interleaved records. – djma Nov 28 '18 at 23:46
  • I don't think that `tf.data.Dataset.concatenate()` has any resemblance to `tf.stack()`. `concatenate()` uses an existing dimension, `stack()` creates a new one. This is exactly the same in `numpy`, compare `np.concatenate()` and `np.stack()`. – bers Jan 13 '20 at 08:19
  • From my tensorboard profiling, it seems like concatenation happens every epoch. Is there a way to perform it only once when pre-processing? – Dr_Zaszuś May 02 '21 at 15:59
1

If by stacking you mean what tf.stack() and np.stack() do:

Stacks a list of rank-R tensors into one rank-(R+1) tensor.

https://www.tensorflow.org/api_docs/python/tf/stack

Join a sequence of arrays along a new axis.

https://docs.scipy.org/doc/numpy/reference/generated/numpy.stack.html

then I believe the closest you can come with a tf.data.Dataset is Dataset.zip():

@staticmethod
zip(datasets)

Creates a Dataset by zipping together the given datasets.

https://www.tensorflow.org/api_docs/python/tf/data/Dataset?version=stable#zip

This allows you to iterate through multiple datasets at the same time by iterating over the shared dimension of the original datasets, similarly to a stack()ed tensor or matrix.

You can then also use .map(tf.stack) or .map(lambda *t: tf.stack(t, axis=-1)) to stack the tensors along new dimensions at the front or back, respectively,

If indeed you want to achieve what tf.concat() and np.concatenate() do, then you use Dataset.concatenate().

bers
  • 2,809
  • 1
  • 19
  • 39