Flatten Dataset of multiple files tensorflow

Question

I'm trying to read the CIFAR-10 dataset from 6 .bin files, and then create a initializable_iterator. This is the site I downloaded the data from, and it also contains a description of the structure of the binary files. Each file contains 2500 images. The resulting iterator, however, only generates one tensor for each file, a tensor of size (2500,3703). Here is my code

import tensorflow as tf

filename_dataset = tf.data.Dataset.list_files("cifar-10-batches-bin/*.bin")    
image_dataset = filename_dataset.map(lambda x: tf.decode_raw(tf.read_file(x), tf.float32))

iter_ = image_dataset.make_initializable_iterator()
next_file_data = iter_.get_next()I 

next_file_data = tf.reshape(next_file_data, [-1,3073])
next_file_img_data, next_file_labels = next_file_data[:,:-1], next_file_data[:,-1]
next_file_img_data = tf.reshape(next_file_img_data, [-1,32,32,3])

init_op = iter_.initializer

with tf.Session() as sess:
    sess.run(init_op)
    print(next_file_img_data.eval().shape) 


_______________________________________________________________________

>> (2500,32,32,3)

The first two lines are based on this answer. I would like to be able to specify the number of images generated by get_next(), using batch() rather than it being the number of images in each .bin file, which here is 2500.

There has already been a question about flattening a dataset here, but the answer is not clear to me. In particular, the question seems to contain a code snippet from a class function which is defined elsewhere, and I am not sure how to implement it.

I have also tried creating the dataset with tf.data.Dataset.from_tensor_slices(), replacing the first line above with

import os

filenames = [os.path.join('cifar-10-batches-bin',f) for f in os.listdir("cifar-10-batches-bin") if f.endswith('.bin')]
filename_dataset = tf.data.Dataset.from_tensor_slices(filenames)

but this didn't solve the problem.

Any help would be very much appreciated. Thanks.

kvish · Accepted Answer · 2019-01-08T11:12:55.300

1

I am not sure how your bin file is structured. I am assuming 32*32*3 = 3072 points per image is present in each file. So the data present in each file is a multiple of 3072. However for any other structure, the kind of operations would be similar, so this can still serve as a guide for that. You could do a series of mapping operations:

import tensorflow as tf

filename_dataset = tf.data.Dataset.list_files("cifar-10-batches-bin/*.bin")    
image_dataset = filename_dataset.map(lambda x: tf.decode_raw(tf.read_file(x), tf.float32))
image_dataset = image_dataset.map(lambda x: tf.reshape(x, [-1, 32, 32, 3]) # Reshape your data to get 2500, 32, 32, 3
image_dataset = image_dataset.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x)) # This operation would give you tensors of shape 32,32,3 and put them all together.
image_dataset = image_dataset.batch(batch_size) # Now you can define your batchsize

edited Jan 08 '19 at 11:12

answered Jan 08 '19 at 06:55

kvish

972
1
6
11

Yes thank you, this works. So does `flat_map()` apply some map to every element of a dataset and also flatten the dataset? This is how I understand the docs https://www.tensorflow.org/api_docs/python/tf/data/Dataset#flat_map), but it seems strange that there is no method for flattening without applying a function. Also, your second last line, and the example from the docs, just has `Dataset.from_tensor_slices()`, whereas I need to use `tf.data.Dataset.from_tensor_slices()` to avoid a NameError. Is this just some alias I'm not using? – ludog Jan 08 '19 at 11:01
The data I'm using is downloaded from Alex Krizhevsky's website https://www.cs.toronto.edu/~kriz/cifar.html, which also contains a description of its structure. I'll edit my question to include this. – ludog Jan 08 '19 at 11:06
@ludog flat_map basically works on nested data structures, and puts all the elements together in to one flat data structure. In the example above, your data would be contained in chunks of 2500, which corresponds to data in each file. We use tensor_slices to slice this dimension to extract individual 32, 32, 3 data and then we put everything together in one dataset structure. And thank you for pointing out the way Im calling from_tensor_slices()! It was a mistake on my part and I will fix it now. – kvish Jan 08 '19 at 11:10
Refer [importing data](https://www.tensorflow.org/guide/datasets#top_of_page), [flat_map example](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#flat_map) from the documentation to get more details on how they work :) – kvish Jan 08 '19 at 11:12
Ok, in that case the same typo is present in the example from the docs you linked. – ludog Jan 08 '19 at 11:40
I'm still not fully clear on what is being converted to what. Is there a difference between the following object types: (a) a dataset, (b) a nested structure of tensors? The docs list the input to `from_tensor_slices()` as a nested structure of tensors and the output as a dataset, and then the input to `flat_map()` as a nested a structure of tensors and the output as a dataset. (Perhaps I should ask this as a separate question?) – ludog Jan 08 '19 at 11:46
@ludog in your case, the first map is now producing data from each file, whose shape is (2500, 32, 32, 3). In the flat_map function, this tensor is the input to tensor_slices. tensor_slices takes this (2500, 32, 32, 3) tensor as input and produces a dataset as output. Each element in this dataset is a tensor of shape (32, 32, 3). The length of this dataset would be 2500. What flat_map does is, it will take all the datasets coming from each file that is mapped this way, and puts everything in to one dataset, whose elements are of shape (32, 32, 3) – kvish Jan 08 '19 at 11:53
It is the meaning of 'a nested structure of tensors' that I am having trouble with. Sometimes it seems to mean a dataset, sometimes it seems to mean a tensor. But I understand the high-level description, and my code is working now, so thank you :) – ludog Jan 08 '19 at 12:10
@ludog you're welcome :) Basically a nested structure of tensors means your input could be a dataset, or it could be a tensor, or any class object that has self.output_shapes and self.output_types defined. This is enabled to ensure that it can work with a variety of complicated scenarios. Note that the output of the flat_map_func is a dataset. – kvish Jan 08 '19 at 12:28
hmm so why does it not work to just run `image_dataset = tf.data.Dataset.from_tensor_slices(image_dataset)` A dataset object has output_shapes and output_types attributes and each of the tensors in image_dataset is of the same shape, right? But it throws the error 'failed to convert MapDataset to tensor'. – ludog Jan 08 '19 at 13:08
@ludog if you look at the [from_tensor_slices](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices) method, the input argument needs nested structure of tensors each having same size in 0th dimension. tf.data does not satisfy this requirement as each element can have different size in 0th dimension. So you cannot convert this to the required tf.Tensor object. – kvish Jan 08 '19 at 13:24

Flatten Dataset of multiple files tensorflow

1 Answers1