23

I have a long list of lists of integers (representing sentences, each one of different sizes) that I want to feed using the tf.data library. Each list (of the lists of list) has different length, and I get an error, which I can reproduce here:

t = [[4,2], [3,4,5]]
dataset = tf.data.Dataset.from_tensor_slices(t)

The error I get is:

ValueError: Argument must be a dense tensor: [[4, 2], [3, 4, 5]] - got shape [2], but wanted [2, 2].

Is there a way to do this?

EDIT 1: Just to be clear, I don't want to pad the input list of lists (it's a list of sentences containing over a million elements, with varying lengths) I want to use the tf.data library to feed, in a proper way, a list of lists with varying length.

mrry
  • 120,078
  • 23
  • 381
  • 391
Escachator
  • 1,283
  • 1
  • 10
  • 26
  • maybe using the mapping function in some way? – Escachator Nov 30 '17 at 21:12
  • Duplicate of [this question](https://stackoverflow.com/questions/40450506/convert-a-list-with-non-fixed-length-elements-to-tensor). – scrpy Dec 04 '17 at 12:33
  • Just to be clear: I don't want to pad the tensor, I want to be able to feed, using the data library, a list of lists with different length. – Escachator Dec 04 '17 at 13:24
  • @landogar note that my specific question is "how to input a list of lists with different sizes in tf.data.Dataset", which is different to the question you link: "Convert a list with non-fixed length elements to tensor". – Escachator Dec 04 '17 at 14:29
  • 1
    If you pass the list of sentences (a list of string) to `tf.data.Dataset.from_tensor_slices` it should work, and you should then be able to transform each sentence to a list of integers using `dataset.map(your_function)`. You can then use `dataset.padded_batch` to automatically add the padding. – Olivier Moindrot Dec 05 '17 at 04:02
  • 1
    This example can be useful: https://github.com/tensorflow/nmt#data-input-pipeline – Olivier Moindrot Dec 05 '17 at 04:12
  • 1
    Hi @OlivierMoindrot, I have seen that example. My concern is: the map functions do they execute when you run the graph on training (i.e. every time you feed new data to the model) or it is executed over the whole dataset before training and then the result of it is fed? The first one seems to me much slower for training vs the second one and that's what I wanted to avoid. – Escachator Dec 05 '17 at 09:23
  • 1
    This is the whole point of `tf.data`, it uses queues in the background and only processes data as needed. You can "prefetch" data to make sure that your GPU is never waiting for data and is working at 100%. As data is consumed at one end (for training), the queues before get filled up with data. You can even have multiple workers with `num_parallel_calls`. – Olivier Moindrot Dec 05 '17 at 09:28
  • that sounds very cool, I'll give it a try. How do you set it up to prefetch data? – Escachator Dec 05 '17 at 09:33
  • 2
    dataset.prefetch – Olivier Moindrot Dec 05 '17 at 17:24

4 Answers4

21

You can use tf.data.Dataset.from_generator() to convert any iterable Python object (like a list of lists) into a Dataset:

t = [[4, 2], [3, 4, 5]]

dataset = tf.data.Dataset.from_generator(lambda: t, tf.int32, output_shapes=[None])

iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
  print(sess.run(next_element))  # ==> '[4, 2]'
  print(sess.run(next_element))  # ==> '[3, 4, 5]'
mrry
  • 120,078
  • 23
  • 381
  • 391
  • @mrry, I am working on the same idea, the dataset that came from the generator can be batched, I mean make it a small batches? – Hunar Oct 09 '18 at 08:31
6

For those working with TensorFlow 2 and looking for an answer I found the following to work directly with ragged tensors. which should be much faster than generator, as long as the entire dataset fits in memory.

t = [[[4,2]],
     [[3,4,5]]]

rt=tf.ragged.constant(t)
dataset = tf.data.Dataset.from_tensor_slices(rt)

for x in dataset:
  print(x)

produces

<tf.RaggedTensor [[4, 2]]>
<tf.RaggedTensor [[3, 4, 5]]>

For some reason, it's very particular about having at least 2 dimensions on the individual arrays.

FlashDD
  • 349
  • 3
  • 12
0

I don't think tensorflow supports tensors with varying numbers of elements along a given dimension.

However, a simple solution is to pad the nested lists with trailing zeros (where necessary):

t = [[4,2], [3,4,5]]
max_length = max(len(lst) for lst in t)
t_pad = [lst + [0] * (max_length - len(lst)) for lst in t]
print(t_pad)
dataset = tf.data.Dataset.from_tensor_slices(t_pad)
print(dataset)

Outputs:

[[4, 2, 0], [3, 4, 5]]
<TensorSliceDataset shapes: (3,), types: tf.int32>

The zeros shouldn't be a big problem for the model: semantically they're just extra sentences of size zero at the end of each list of actual sentences.

scrpy
  • 809
  • 4
  • 22
  • Hi, thanks for the answer, by I cannot pad the whole list of lists due to it's size. I will do the padding for every batch, but not for the whole dataset composed of millions of sentences. – Escachator Dec 04 '17 at 13:27
0

In addition to @mrry's answer, the following code is also possible if you would like to create (images, labels) pair:

import itertools
data = tf.data.Dataset.from_generator(lambda: itertools.izip_longest(images, labels),
                                      output_types=(tf.float32, tf.float32),
                                      output_shapes=(tf.TensorShape([None, None, 3]), 
                                                     tf.TensorShape([None])))

iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
    image, label = sess.run(next_element)  # ==> shape: [320, 420, 3], [20]
    image, label = sess.run(next_element)  # ==> shape: [1280, 720, 3], [40]
Dat
  • 3,512
  • 24
  • 28