RNN in Tensorflow vs Keras, depreciation of tf.nn.dynamic_rnn()

Question

My question is: Are the tf.nn.dynamic_rnn and keras.layers.RNN(cell) truly identical as stated in docs?

I am planning on building an RNN, however, it seems that tf.nn.dynamic_rnn is depricated in favour of Keras.

In particular, it states that:

Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: Please use keras.layers.RNN(cell), which is equivalent to this API

But I don't see how the APIs are equivalent, in the case of variable sequence lengths!

In raw TF, we can specify a tensor of shape (batch_size, seq_lengths). This way, if our sequence is [0, 1, 2, 3, 4] and the longest sequence in the batch is of size 10, we can pad it with 0s and [0, 1, 2, 3, 4, 0, 0, 0, 0, 0], we can say seq_length=5 to process [0, 1, 2, 3, 4].

However, in Keras, this is not how it works! What we can do, is specify the mask_zero=True in previous Layers, e.g. the Embedding Layer. This will also mask the 1st zero!

I can go around it by adding ones to the whole vector, but then thats extra preprocessing that I need to do after processing using tft.compute_vocabulary(), which maps vocabulary words to 0 indexed vector.

I question, whether you should really care about that (i.e. the previous `seq_lengths`). From the docs *...So it's more for performance than correctness.* — rst, Mar 15 '19 at 10:22
@rst I don't actually understand the issue about correctness. If I input the 0s, the matrix multiplication will also be 0, but then I will add a 1 bias which is passed through an activation function with it's weight. I will most likely get a non-zero output due to the bias term. Hence the bias weight will continue to train? Or is my understanding incorrect? — GRS, Mar 15 '19 at 17:07
@rst Assuming they mean that there is no difference between passing the remaining 'padded' 0s into the RNN or masking them e.g. not training on them. — GRS, Mar 15 '19 at 17:08
For now you can use `tf.keras.layers.Masking()` to dealt with it, but the thing is Masking is not supported `CuDNN RNN`. Probably the problem will be solved in TF. 2.0 https://github.com/tensorflow/tensorflow/issues/23269 — Nicolabo, Apr 10 '19 at 11:43

score 7 · Accepted Answer · answered May 01 '19 at 13:45

No, but they are (or can be made to be) not so different either.

TL;DR

tf.nn.dynamic_rnn replaces elements after the sequence end with 0s. This cannot be replicated with tf.keras.layers.* as far as I know, but you can get a similar behaviour with RNN(Masking(...) approach: it simply stops the computation and carries the last outputs and states forward. You will get the same (non-padding) outputs as those obtained from tf.nn.dynamic_rnn.

Experiment

Here is a minimal working example demonstrating the differences between tf.nn.dynamic_rnn and tf.keras.layers.GRU with and without the use of tf.keras.layers.Masking layer.

import numpy as np
import tensorflow as tf

test_input = np.array([
    [1, 2, 1, 0, 0],
    [0, 1, 2, 1, 0]
], dtype=int)
seq_length = tf.constant(np.array([3, 4], dtype=int))

emb_weights = (np.ones(shape=(3, 2)) * np.transpose([[0.37, 1, 2]])).astype(np.float32)
emb = tf.keras.layers.Embedding(
    *emb_weights.shape,
    weights=[emb_weights],
    trainable=False
)
mask = tf.keras.layers.Masking(mask_value=0.37)
rnn = tf.keras.layers.GRU(
    1,
    return_sequences=True,
    activation=None,
    recurrent_activation=None,
    kernel_initializer='ones',
    recurrent_initializer='zeros',
    use_bias=True,
    bias_initializer='ones'
)


def old_rnn(inputs):
    rnn_outputs, rnn_states = tf.nn.dynamic_rnn(
        rnn.cell,
        inputs,
        dtype=tf.float32,
        sequence_length=seq_length
    )
    return rnn_outputs


x = tf.keras.layers.Input(shape=test_input.shape[1:])
m0 = tf.keras.Model(inputs=x, outputs=emb(x))
m1 = tf.keras.Model(inputs=x, outputs=rnn(emb(x)))
m2 = tf.keras.Model(inputs=x, outputs=rnn(mask(emb(x))))

print(m0.predict(test_input).squeeze())
print(m1.predict(test_input).squeeze())
print(m2.predict(test_input).squeeze())

sess = tf.keras.backend.get_session()
print(sess.run(old_rnn(mask(emb(x))), feed_dict={x: test_input}).squeeze())

The outputs from m0 are there to show the result of applying the embedding layer. Note that there are no zero entries at all:

[[[1.   1.  ]    [[0.37 0.37]
  [2.   2.  ]     [1.   1.  ]
  [1.   1.  ]     [2.   2.  ]
  [0.37 0.37]     [1.   1.  ]
  [0.37 0.37]]    [0.37 0.37]]]

Now here are the actual outputs from the m1, m2 and old_rnn architectures:

m1: [[  -6.  -50. -156. -272.7276 -475.83362]
     [  -1.2876 -9.862801 -69.314 -213.94202 -373.54672 ]]
m2: [[  -6.  -50. -156. -156. -156.]
     [   0.   -6.  -50. -156. -156.]]
old [[  -6.  -50. -156.    0.    0.]
     [   0.   -6.  -50. -156.    0.]]

Summary

The old tf.nn.dynamic_rnn used to mask padding elements with zeros.
The new RNN layers without masking run over the padding elements as if they were data.
The new rnn(mask(...)) approach simply stops the computation and carries the last outputs and states forward. Note that the (non-padding) outputs that I obtained for this approach are exactly the same as those from tf.nn.dynamic_rnn.

Anyway, I cannot cover all possible edge cases, but I hope that you can use this script to figure things out further.

I expanded on this [in this answer](https://stackoverflow.com/questions/55264696/tensorflow-dynamic-rnn-deprecation) to show masking without an embedding layer. Great answer, this helped me a lot. — parrowdice, Jun 19 '19 at 09:30
I made an interesting discovery this evening- if you wrap the GRU cell in a Bidirectional layer, it will convert the carried outputs to zero, therefore obtaining identical output to the old implementation without the necessity of predefined sequence lengths. — Ryan Walden, Dec 19 '19 at 04:27

RNN in Tensorflow vs Keras, depreciation of tf.nn.dynamic_rnn()

1 Answers1

TL;DR

Experiment

Summary

Linked