Custom TensorFlow Keras optimizer

Question

Suppose I want to write a custom optimizer class that conforms to the tf.keras API (using TensorFlow version>=2.0). I am confused about the documented way to do this versus what's done in implementations.

The documentation for tf.keras.optimizers.Optimizer states,

  ### Write a customized optimizer.
  If you intend to create your own optimization algorithm, simply inherit from
  this class and override the following methods:

    - resource_apply_dense (update variable given gradient tensor is dense)
    - resource_apply_sparse (update variable given gradient tensor is sparse)
    - create_slots (if your optimizer algorithm requires additional variables)

However, the current tf.keras.optimizers.Optimizer implementation does not define a resource_apply_dense method, but it does define a private-looking _resource_apply_dense method stub. Similarly, there are no resource_apply_sparse or create_slots methods, but there are a _resource_apply_sparse method stub and a _create_slots method call.

In official tf.keras.optimizers.Optimizer subclasses (using tf.keras.optimizers.Adam as an example), there are _resource_apply_dense, _resource_apply_sparse, and _create_slots methods, and there are no such methods without the leading underscore.

There are similar leading-underscore methods in slightly-less-official tf.keras.optimizers.Optimizer subclasses (e.g., tfa.optimizers.MovingAverage from TensorFlow Addons: _resource_apply_dense, _resource_apply_sparse, _create_slots).

Another confounding point for me is that some of the TensorFlow Addons optimizers also override the apply_gradients method (e.g., tfa.optimizers.MovingAverage), whereas the tf.keras.optimizers optimizers do not.

Moreover, I noticed that the apply_gradients method of tf.keras.optimizers.Optimizer method calls _create_slots, but the base tf.keras.optimizers.Optimizer class does not have a _create_slots method. So, it seems that a _create_slots method must be defined in an optimizer subclass if that subclass does not override apply_gradients.

Questions

What is the correct way to subclass a tf.keras.optimizers.Optimizer? Specifically,

Does the tf.keras.optimizers.Optimizer documentation listed at the top simply mean to override the leading-underscore versions of the methods they mention (e.g., _resource_apply_dense instead of resource_apply_dense)? If so, are there any API guarantees about these private-looking methods not changing their behavior in future versions of TensorFlow? What are the signatures of these methods?
When would one override apply_gradients in addition to the _apply_resource_[dense|sparse] methods?

Edit. Opened issue on GitHub: #36449

This may be something to report as a documentation issue to the devs. It most definitely looks like those methods to override should included the initial underscore in the documentation, but in any case, like you say, there is no information about their signature and exact purpose. It may also be that method names without underscore (and documented) are planned to be added (like with `get_config`), but then they shouldn't yet appear in the [public documentation](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Optimizer#write_a_customized_optimizer_2). — jdehesa, Jan 24 '20 at 10:14
For the signatures, you can always look at the declaration of [`_resource_apply_dense`](https://github.com/tensorflow/tensorflow/blob/v2.1.0/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L863-L874) or [`_resource_apply_sparse`](https://github.com/tensorflow/tensorflow/blob/v2.1.0/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L905-L923), and see their usage in implemented optimizers. While it may not be, I think, public API with stability guarantees, I'd say it's pretty safe to use them. They just should provide better guidance in this aspect. — jdehesa, Jan 24 '20 at 10:27
I agree that this is a documentation issue with TensorFlow. Did you create an issue for this in the tf Github repo? If so, could you share the link here? — jpgard, Feb 03 '20 at 22:11

OverLordGoldDragon · Answer 1 · 2020-05-27T11:46:24.317

Update: TF2.2 forced me to clean up all implementations - so now they can be used as a reference for TF best practices. Also added a section below on _get_hyper vs. _set_hyper.

I've implemented Keras AdamW in all major TF & Keras versions - I invite you to examine optimizers_v2.py. Several points:

You should inherit OptimizerV2, which is actually what you linked; it's the latest and current base class for tf.keras optimizers
You are correct in (1) - this is a documentation mistake; the methods are private, as they aren't meant to be used by the user directly.
apply_gradients (or any other method) is only overidden if the default doesn't accomplish what's needed for a given optimizer; in your linked example, it's just a one-liner addon to the original
"So, it seems that a _create_slots method must be defined in an optimizer subclass if that subclass does not override apply_gradients" - the two are unrelated; it's coincidental.

What is the difference between _resource_apply_dense and _resource_apply_sparse?

Latter deals with sparse layers - e.g. Embedding - and former with everything else; example.

When should I use _create_slots()?

When defining trainable tf.Variables; example: weights' first and second order moments (e.g. Adam). It uses add_slot().

_get_hyper vs. _set_hyper: they enable setting and getting Python literals (int, str, etc), callables, and tensors. They exist largely for convenience: anything set via _set_hyper can be retrieved via _get_hyper, avoiding repeating boilerplate code. I dedicated a Q&A to it here.

score 2 · Answer 2 · answered Feb 19 '20 at 21:32

Yes, this looks to be a documentation error. The preceding underscore names are the correct methods to override. Related is the non-Keras Optimizer which has these all defined, but not implemented in the base class https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/optimizer.py

  def _create_slots(self, var_list):
    """Create all slots needed by the variables.
    Args:
      var_list: A list of `Variable` objects.
    """
    # No slots needed by default
    pass

  def _resource_apply_dense(self, grad, handle):
    """Add ops to apply dense gradients to the variable `handle`.
    Args:
      grad: a `Tensor` representing the gradient.
      handle: a `Tensor` of dtype `resource` which points to the variable
       to be updated.
    Returns:
      An `Operation` which updates the value of the variable.
    """
    raise NotImplementedError()

  def _resource_apply_sparse(self, grad, handle, indices):
    """Add ops to apply sparse gradients to the variable `handle`.
    Similar to `_apply_sparse`, the `indices` argument to this method has been
    de-duplicated. Optimizers which deal correctly with non-unique indices may
    instead override `_resource_apply_sparse_duplicate_indices` to avoid this
    overhead.
    Args:
      grad: a `Tensor` representing the gradient for the affected indices.
      handle: a `Tensor` of dtype `resource` which points to the variable
       to be updated.
      indices: a `Tensor` of integral type representing the indices for
       which the gradient is nonzero. Indices are unique.
    Returns:
      An `Operation` which updates the value of the variable.
    """
    raise NotImplementedError()

I don't know about apply_dense. For one thing, if you do override it, the code mentions that a per-replica DistributionStrategy could be "dangerous"

    # TODO(isaprykin): When using a DistributionStrategy, and when an
    # optimizer is created in each replica, it might be dangerous to
    # rely on some Optimizer methods.  When such methods are called on a
    # per-replica optimizer, an exception needs to be thrown.  We do
    # allow creation per-replica optimizers however, because the
    # compute_gradients()->apply_gradients() sequence is safe.

Custom TensorFlow Keras optimizer

Questions

2 Answers2