What's different about momentum gradient update in Tensorflow and Theano like this?

Question

I'm trying to use TensorFlow with my deep learning project.
Here I need implement my gradient update in this formula :

I have also implement this part in Theano, and it came out the expected answer. But when I try to use TensorFlow's MomentumOptimizer, the result is really bad. I don't know what is different between them.

Theano:

def gradient_updates_momentum_L2(cost, params, learning_rate, momentum, weight_cost_strength):
    # Make sure momentum is a sane value
    assert momentum < 1 and momentum >= 0
    # List of update steps for each parameter
    updates = []
    # Just gradient descent on cost
    for param in params:
        param_update = theano.shared(param.get_value()*0., broadcastable=param.broadcastable)
        updates.append((param, param - learning_rate*(param_update + weight_cost_strength * param_update)))
        updates.append((param_update, momentum*param_update + (1. - momentum)*T.grad(cost, param)))
    return updates

TensorFlow:

l2_loss = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables()])
cost = cost + WEIGHT_COST_STRENGTH * l2_loss
train_op = tf.train.MomentumOptimizer(LEARNING_RATE, MOMENTUM).minimize(cost)

score 5 · Answer 1 · answered Feb 28 '16 at 13:46

5

If you look at the implementation of momentum optimizer in TensorFlow [link], it is implemented as follows:

accum = accum * momentum() + grad;
var -= accum * lr();

As you see, the formulas are a bit different. Scaling momentum term by the learning rate should resolve your differences.

It is also very easy to implement such optimizer by yourself. The resulting code would look similar to the snippet in Theano that you included.

answered Feb 28 '16 at 13:46

Rafał Józefowicz

6,069
2
22
18

It is not the only difference. The formula posted by OP updates `w(t)` by adding the momentum term `\alpha v(t-1)`, whilst tensorflow code actually subtracts it. According to [this](http://sebastianruder.com/optimizing-gradient-descent/) the tensorflow code seems to be more correct. – Tu Bui Apr 11 '17 at 11:20

What's different about momentum gradient update in Tensorflow and Theano like this?

1 Answers1

Linked