How to input a list of lists with different sizes in tf.data.Dataset

Question

I have a long list of lists of integers (representing sentences, each one of different sizes) that I want to feed using the tf.data library. Each list (of the lists of list) has different length, and I get an error, which I can reproduce here:

t = [[4,2], [3,4,5]]
dataset = tf.data.Dataset.from_tensor_slices(t)

The error I get is:

ValueError: Argument must be a dense tensor: [[4, 2], [3, 4, 5]] - got shape [2], but wanted [2, 2].

Is there a way to do this?

EDIT 1: Just to be clear, I don't want to pad the input list of lists (it's a list of sentences containing over a million elements, with varying lengths) I want to use the tf.data library to feed, in a proper way, a list of lists with varying length.

Duplicate of [this question](https://stackoverflow.com/questions/40450506/convert-a-list-with-non-fixed-length-elements-to-tensor). — scrpy, Dec 04 '17 at 12:33
Just to be clear: I don't want to pad the tensor, I want to be able to feed, using the data library, a list of lists with different length. — Escachator, Dec 04 '17 at 13:24
@landogar note that my specific question is "how to input a list of lists with different sizes in tf.data.Dataset", which is different to the question you link: "Convert a list with non-fixed length elements to tensor". — Escachator, Dec 04 '17 at 14:29
If you pass the list of sentences (a list of string) to `tf.data.Dataset.from_tensor_slices` it should work, and you should then be able to transform each sentence to a list of integers using `dataset.map(your_function)`. You can then use `dataset.padded_batch` to automatically add the padding. — Olivier Moindrot, Dec 05 '17 at 04:02
This example can be useful: https://github.com/tensorflow/nmt#data-input-pipeline — Olivier Moindrot, Dec 05 '17 at 04:12
Hi @OlivierMoindrot, I have seen that example. My concern is: the map functions do they execute when you run the graph on training (i.e. every time you feed new data to the model) or it is executed over the whole dataset before training and then the result of it is fed? The first one seems to me much slower for training vs the second one and that's what I wanted to avoid. — Escachator, Dec 05 '17 at 09:23
This is the whole point of `tf.data`, it uses queues in the background and only processes data as needed. You can "prefetch" data to make sure that your GPU is never waiting for data and is working at 100%. As data is consumed at one end (for training), the queues before get filled up with data. You can even have multiple workers with `num_parallel_calls`. — Olivier Moindrot, Dec 05 '17 at 09:28
that sounds very cool, I'll give it a try. How do you set it up to prefetch data? — Escachator, Dec 05 '17 at 09:33

score 21 · Accepted Answer · answered Dec 08 '17 at 19:46

21

You can use tf.data.Dataset.from_generator() to convert any iterable Python object (like a list of lists) into a Dataset:

t = [[4, 2], [3, 4, 5]]

dataset = tf.data.Dataset.from_generator(lambda: t, tf.int32, output_shapes=[None])

iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
  print(sess.run(next_element))  # ==> '[4, 2]'
  print(sess.run(next_element))  # ==> '[3, 4, 5]'

answered Dec 08 '17 at 19:46

mrry

120,078
23
381
391

@mrry, I am working on the same idea, the dataset that came from the generator can be batched, I mean make it a small batches? – Hunar Oct 09 '18 at 08:31

score 6 · Answer 2 · answered Nov 15 '19 at 02:55

For those working with TensorFlow 2 and looking for an answer I found the following to work directly with ragged tensors. which should be much faster than generator, as long as the entire dataset fits in memory.

t = [[[4,2]],
     [[3,4,5]]]

rt=tf.ragged.constant(t)
dataset = tf.data.Dataset.from_tensor_slices(rt)

for x in dataset:
  print(x)

produces

<tf.RaggedTensor [[4, 2]]>
<tf.RaggedTensor [[3, 4, 5]]>

For some reason, it's very particular about having at least 2 dimensions on the individual arrays.

＋1 but FYI with tf 2.1 the extra brackets are no longer necessary — Innocent Bystander, Apr 27 '20 at 22:33

scrpy · Answer 3 · 2017-12-04T12:39:50.443

0

I don't think tensorflow supports tensors with varying numbers of elements along a given dimension.

However, a simple solution is to pad the nested lists with trailing zeros (where necessary):

t = [[4,2], [3,4,5]]
max_length = max(len(lst) for lst in t)
t_pad = [lst + [0] * (max_length - len(lst)) for lst in t]
print(t_pad)
dataset = tf.data.Dataset.from_tensor_slices(t_pad)
print(dataset)

Outputs:

[[4, 2, 0], [3, 4, 5]]
<TensorSliceDataset shapes: (3,), types: tf.int32>

The zeros shouldn't be a big problem for the model: semantically they're just extra sentences of size zero at the end of each list of actual sentences.

edited Dec 04 '17 at 12:39

answered Dec 04 '17 at 12:20

scrpy

809
4
22

Hi, thanks for the answer, by I cannot pad the whole list of lists due to it's size. I will do the padding for every batch, but not for the whole dataset composed of millions of sentences. – Escachator Dec 04 '17 at 13:27

score 0 · Answer 4 · answered Dec 31 '17 at 09:09

In addition to @mrry's answer, the following code is also possible if you would like to create (images, labels) pair:

import itertools
data = tf.data.Dataset.from_generator(lambda: itertools.izip_longest(images, labels),
                                      output_types=(tf.float32, tf.float32),
                                      output_shapes=(tf.TensorShape([None, None, 3]), 
                                                     tf.TensorShape([None])))

iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
    image, label = sess.run(next_element)  # ==> shape: [320, 420, 3], [20]
    image, label = sess.run(next_element)  # ==> shape: [1280, 720, 3], [40]

How to input a list of lists with different sizes in tf.data.Dataset

4 Answers4

Linked