13

We are running multi GPU jobs on Tensorflow and evaluating a migration from the queue based model (using the string_input_producer interface) to the new Tensorflow Dataset API. The latter appears to offer an easier way to switch between Train and Validation, concurrently.

A snippet of code below shows how we are doing this.

    train_dataset, train_iterator = get_dataset(train_files, batch_size, epochs)
    val_dataset, val_iterator = get_dataset(val_files, batch_size, epochs)


    is_validating = tf.placeholder(dtype=bool, shape=())
    next_batch = tf.cond(is_validating,
               lambda: val_iterator.get_next(),
               lambda: train_iterator.get_next())

    validation_tower = self.num_gpus - 1
    tower_grads = []

    for i in range(self.num_gpus):
        with tf.variable_scope(tf.get_variable_scope(),reuse=(i > 0)):
            with tf.device('/gpu:%d' % i), tf.name_scope('%s_%d' % ('gpu_', i)) as scope:
                if i == validation_tower:
                    images, labels = next_batch
                    # Loss funcs snipped out
                else:
                    images, labels = next_batch
                    # Loss funcs snipped out

The get_dataset function builds a dataset, sets a map function and a batch size. It also builds an iterator, but doesn't initialize it. Initialization of the iterator occurs before the session starts.

The is_validating boolean is supplied while the session is running, and every few steps we pass is_validating as True via a feed_dict to use the validation dataset

The question I have is:

Lets say I have 8 gpus, so we run training on 7 GPUs. Does the Iterator advance from the same point for each of these 7 GPUs, hence supplying all 7 GPU's with the same data?

7hacker
  • 1,648
  • 2
  • 16
  • 29

1 Answers1

22

At present there are three main options, which have different usability and performance trade-offs:

  1. In the Dataset.batch() transform, create a single large batch containing examples for all of your GPUs. Then use tf.split(..., self.num_gpus) on the output of Iterator.get_next() to create sub-batches for each GPU. This is probably the easiest approach, but it does place the splitting on the critical path.

  2. In the Dataset.batch() transform, create a mini-batch that is sized for a single GPU. Then call Iterator.get_next() once per GPU to get multiple different batches. (By contrast, in your current code, the same value of next_batch is sent to each GPU, which is probably not what you wanted to happen.)

  3. Create multiple iterators, one per GPU. Shard the data using Dataset.shard() early in the pipeline (e.g. on the list of files if your dataset is sharded). Note that this approach will consume more resources on the host, so you may need to dial down any buffer sizes and/or degrees of parallelism

Note that the current tf.data pipelines run on the CPU only, and an important aspect of an efficient pipeline is staging your training input to the GPU while the previous step is still running. See the TensorFlow CNN benchmarks for example code that shows how to stage data to GPUs efficiently. We are currently working on adding this support to the tf.data API directly.

Olivier Moindrot
  • 26,084
  • 11
  • 84
  • 87
mrry
  • 120,078
  • 23
  • 381
  • 391
  • Thanks, @mrry! Very helpful and detailed. Here's another question for you: In our current queue-based approach (tf.train.string_input_producer), we have a queue per GPU and each Queue has a copy of the entire dataset. Each queue is shuffled and the job runs infinitely over the dataset for a fixed number of steps. We tried a Dataset, Iterator per GPU approach similarly and it works. However, the main difference we noticed is the number of examples processed. In the former we process around 200 examples per second and in the latter about 20, so 10X lower throughput (continued..) – 7hacker Oct 27 '17 at 18:34
  • Is this because string_input_producers are running on GPU's natively? Is there a Dataset, Iterator approach that can enable this kind of throughput? Any other suggestions of our current approach would be helpful as well. – 7hacker Oct 27 '17 at 18:35
  • I am trying to implement this solution into benchmarks code, but it seems so difficult not because of the tf.Dataset but because of the threads to get the data. The only thing I changed is a tf.cond inside minibatch function but the code stucks forever on self.cond.wait() – chrisrn Feb 05 '18 at 09:50