27

With the recent upgrade to version 1.4, Tensorflow included tf.data in the library core. One "major new feature" described in the version 1.4 release notes is tf.data.Dataset.apply(), which is a "method for applying custom transformation functions". How is this different from the already existing tf.data.Dataset.map()?

mrry
  • 120,078
  • 23
  • 381
  • 391
GPhilo
  • 15,115
  • 6
  • 55
  • 75

4 Answers4

35

The difference is that map will execute one function on every element of the Dataset separately, whereas apply will execute one function on the whole Dataset at once (such as group_by_window given as example in the documentation).

The argument of apply is a function that takes a Dataset and returns a Dataset when the argument of map is a function that takes one element and returns one transformed element.

Sunreef
  • 3,879
  • 14
  • 31
  • ...I was wondering where was the documentation for the new functions to use with `apply`. And now I see that the functions are in tf.**contrib**.data, while the Dataset API was moved to tf.data, which is where I was searching for. – GPhilo Nov 03 '17 at 13:47
  • I'm afraid I still don't understand the difference in practical terms. `map` is used to transform the values in a dataset, while `apply` operates "on" the dataset itself... so `apply` could also replace `map`? – GPhilo Nov 03 '17 at 13:59
  • `apply` is used when you need to consider several elements at once. For example, if you want to create a dataset with averages of five consecutive elements in your dataset, then you couldn't do that with `map` – Sunreef Nov 03 '17 at 14:01
  • I see! Ok, that now makes sense. So, if I got it correctly, `apply` can replace entirely `map` (because if it has access to all items, it can also have access to them one at a time like `map`), but writing functions for `apply` is not as straightforward as writing functions that operate on the values themselves, so we get to keep `map` for practical reasons. Does that make sense? – GPhilo Nov 03 '17 at 14:06
  • 1
    That's how I understand it, yes. If you wanted to replace `map` with an `apply` call, then your function inside `apply` would have to do the equivalent of a `map` anyway. – Sunreef Nov 03 '17 at 14:10
20

Sunreef's answer is absolutely correct. You might still be wondering why we introduced Dataset.apply(), and I thought I'd offer some background.

The tf.data API has a set of core transformations—like Dataset.map() and Dataset.filter()—that are generally useful across a wide range of datasets, unlikely to change, and implemented as methods on the tf.data.Dataset object. In particular, they are subject to the same backwards compatibility guarantees as other core APIs in TensorFlow.

However, the core approach is a bit restrictive. We also want the freedom to experiment with new transformations before adding them to the core, and to allow other library developers to create their own reusable transformations. Therefore, in TensorFlow 1.4 we split out a set of custom transformations that live in tf.contrib.data. The custom transformations include some that have very specific functionality (like tf.contrib.data.sloppy_interleave()), and some where the API is still in flux (like tf.contrib.data.group_by_window()). Originally we implemented these custom transformations as functions from Dataset to Dataset, which had an unfortunate effect on the syntactic flow of a pipeline. For example:

dataset = tf.data.TFRecordDataset(...).map(...)

# Method chaining breaks when we apply a custom transformation.
dataset = custom_transformation(dataset, x, y, z)

dataset = dataset.shuffle(...).repeat(...).batch(...)

Since this seemed to be a common pattern, we added Dataset.apply() as a way to chain core and custom transformations in a single pipeline:

dataset = (tf.data.TFRecordDataset(...)
           .map(...)
           .apply(custom_transformation(x, y, z))
           .shuffle(...)
           .repeat(...)
           .batch(...))

It's a minor feature in the grand scheme of things, but hopefully it helps to make tf.data programs easier to read, and the library easier to extend.

mrry
  • 120,078
  • 23
  • 381
  • 391
  • Thank you!! This is exactly what I hoped to understand, and the piece I was missing to fully grasp the new interface. I'll leave @Sunreef's answer marked as "the" answer, but in truth the two answers complement each other. – GPhilo Nov 03 '17 at 15:38
  • @mrry how are map and apply are executed, are their executed on the batch size everytime you can next element of the iterator or are applied to the whole dataset once the graph has been initialized? the reason i am asking is because loading a dataset such as imagenet might take much longer if applied on the every batch that once for all on the whole dataset – Eliethesaiyan May 31 '18 at 09:03
  • @Eliethesaiyan Most transformations in `tf.data` are stateless and streaming, which means that they can return results immediately and consume little memory when dealing with large datasets that might not fit in memory. This means that, if you have a `Dataset.repeat()` following those transformations, they will execute on each pass through the data. If you want to avoid this, add a `Dataset.cache()` after the transformations that you want to perform once. – mrry May 31 '18 at 14:50
  • 1
    @mrry could you provide a "hello world implementation" of `custom_transformation`? – Marsellus Wallace Nov 30 '18 at 16:17
4

I don't have enough reputation to comment, but I just wanted to point out that you can actually use map to apply to multiple elements in a dataset contrary to @sunreef's comments on his own post.

According to the documentation, map takes as an argument

map_func: A function mapping a nested structure of tensors (having shapes and types defined by self.output_shapes and self.output_types) to another nested structure of tensors.

the output_shapes are defined by the dataset and can be modified by using api functions like batch. So, for example, you can do a batch normalization using only dataset.batch and .map with:

dataset = dataset ...
dataset.batch(batch_size)
dataset.map(normalize_fn)

It seems like the primary utility of apply() is when you really want to do a transformation across the entire dataset.

zephyrus
  • 1,101
  • 1
  • 9
  • 25
  • Your use-case can indeed be built with a combination of batch and map, but map is still operating on the single values of the dataset it's called on (which in this case are of course batches). Apply is more general, because it allows, for example, to have variable-length batches, reordering, grouping, etc. @Sunreef's example of group_by_window cannot be reproduced with map() (note that this isn't a transformation that needs to be processed on the whole dataset, since windows could be built incrementally) – GPhilo Nov 03 '17 at 21:18
  • 2
    I agree -- again this was more of a comment directed at people new to datasetAPI who might not be familiar with the way "element" is actually a dynamic concept depending on how you define the dataset. Its intuitive once you get it, but took me a second. – zephyrus Nov 03 '17 at 21:24
0

Simply, the arguement of transformation_func of apply() is Dataset; the arguement of map_func of map() is element

武状元
  • 135
  • 1
  • 5
  • 12