Using datasets larger than 2 Gb with Keras

Question

TensorFlow has a long-standing limitation of 2 Gb on a single tensor. It means that you can't train your model on more than 2 Gb of data at one time without jumping through hoops. See Initializing tensorflow Variable with an array larger than 2GB ; Use large dataset in Tensorflow

The standard solution referenced in those posts is to use a placeholder and to pass it to the "session" through feed_dict:

my_graph = tf.Graph()
sess = tf.Session(graph=my_graph)   
X_init = tf.placeholder(tf.float32, shape=(m_input, n_input))
X = tf.Variable(X_init)
sess.run(tf.global_variables_initializer(), feed_dict={X_init: data_for_X})

However, this only works when I use the "old" API (tf.Session(), etc.) The recommended approach nowadays is to use Keras (all the tutorials on tensorflow.org use it). And, with Keras, there's no tf.Graph(), no tf.Session(), and no run() (at least none that are readily visible to the user.)

How do I adapt the above code to work with Keras?

You don't have this problem in Keras, did you actually try training a Keras model and see if you got a problem? It uses a completely different API, so I don't see where is the problem. — Dr. Snoopy, Dec 26 '18 at 23:15
I get a, "ValueError: Cannot create a tensor proto whose content is larger than 2GB." in tf.convert_to_tensor(), before I can even call any Keras API functions, when I try to convert my dataset from numpy.ndarray into a tensor. If I feed the data to Keras as numpy.ndarray directly, it bottlenecks inside Python code, with GPU utilization 10%. — Eugene Smith, Dec 26 '18 at 23:30
Why do you need to convert to tensor? That is not needed at all, you just make a generator and use fit_generator or use fit directly if your data fits into RAM. Make sure to adjust the batch size to maximize performance. I've trained Keras models with a 600 GB dataset (OpenImages) without issues. — Dr. Snoopy, Dec 26 '18 at 23:35
@MatiasValdenegro https://stackoverflow.com/questions/53931748/keras-tensorflow-numpy-vs-tensor-performance — Eugene Smith, Dec 26 '18 at 23:39
@EugeneSmith Batch size in the order of thousands will bottleneck the CPU and CPU-GPU data transfer, that's why you see those results. — Dr. Snoopy, Dec 26 '18 at 23:41

Daniel Möller · Answer 1 · 2018-12-27T01:59:24.087

6

In Keras, you'd not load your entire dataset in a tensor. You load it in numpy arrays.

If the entire data can be in a single numpy array:

Thanks to @sebrockm's comment.

The most trivial usage of Keras is simply loading your dataset in a numpy array (not a tf tensor) and call model.fit(arrayWithInputs, arrayWithoutputs, ...)

If the entire data doesn't fit a numpy array:

You'd create a generator or a keras.utils.Sequence to load batches one by one and then train the model with model.fit_generator(generatorOrSequence, ...)

The limitation becomes the batch size, but you'd hardly ever hit 2GB in a single batch. So, go for it:

edited Dec 27 '18 at 01:59

answered Dec 26 '18 at 23:57

Daniel Möller

74,597
15
158
180

But that would entail creating batches in Python code, which is surely something we'd all want to avoid? If I have 2 GB of data to be divided into 1000 batches, I want to hand the entire 2 GB to TF and say 'train on this, divided into 1000 batches'. I don't want TF to do 1000 Python calls to pull batches one by one, since that would be disastrous performance-wise. – Eugene Smith Dec 27 '18 at 00:15
@EugeneSmith Batching in python is something we all regularly do, and it works fine. At least in Keras :) – Dr. Snoopy Dec 27 '18 at 00:57
1

@EugeneSmith if your entire dataset is only 2GB, you don't need `fit_generator` at all, just `fit` will suffice (unless you have only 2GB total RAM). `fit_generator` is only useful if the **entire** dataset (before dividing it into batches) doesn't fit into RAM. – sebrockm Dec 27 '18 at 00:58

score 3 · Answer 2 · answered Dec 26 '18 at 23:54

Keras doesn't have a 2GB limitation for datasets, I've trained much larger datasets with Keras with no issues.

The limitation could come from TensorFlow constants, which do have a 2GB limit, but in any case you should NOT store datasets as constants, as these are saved as part of the graph, and that is not the idea of storing a model.

Keras has the model.fit_generator function that you can use to pass a generator function which loads data on the fly, and makes batches. This allows you to load a large dataset on the fly, and you usually adjust the batch size so you maximize performance with acceptable RAM usage. TensorFlow doesn't have a similar API, you have to implement it manually as you say with feed_dict.

Using datasets larger than 2 Gb with Keras

2 Answers2

If the entire data can be in a single numpy array:

If the entire data doesn't fit a numpy array: