3

From this post, we can write a custom loss function. Now, assume that the custom loss function depends on parameter a:

def customLoss(yTrue,yPred):
    return (K.log(yTrue) - K.log(yPred))**2+a*yPred

How can we update parameter a at each step in a gradient descent manner like the weights?:

a_new= a_old - alpha * (derivative of custom loss with respect to a)

P.S. the real custom loss is different from the above. Please give me a general answer that works for any arbitrary custom loss function, not an answer to the example above.

OverLordGoldDragon
  • 14,529
  • 6
  • 35
  • 69
Albert
  • 368
  • 1
  • 12
  • Very interesting problem. – Sus20200 Oct 25 '19 at 21:54
  • I've never seen a parameter directly in the loss function be updated via gradient descent, nor do I think it viable or feasible; to be of better help, your exact 'actual' loss expression would help. Regardless what it is, however, I can imagine it being updated so to drive loss straight to zero at each iteration - nullifying any 'learning'. – OverLordGoldDragon Oct 25 '19 at 22:05
  • There _is_ a workaround, by treating _a_ as an optimizable hyperparameter, updated via a 'meta-learner' - and while it can be done at per-iteration, it'd work lot better on at least per-epoch basis. Optimization can be done both w.r.t. train and validation sets. Let me know if interested. – OverLordGoldDragon Oct 25 '19 at 22:16
  • @OverLordGoldDragon Yes, that's what I want to do. I want to find an a that leads to minimal loss. I think your workaround gives the solution to this. – Albert Oct 25 '19 at 22:22
  • @OverLordGoldDragon Why do you think it works better if a is updated after each epoch, and not after each iteration? – Albert Oct 25 '19 at 22:32
  • @Albert `a` is a _hyperparameter_, not a parameter, hence it cannot be optimized via gradient descent (unless via quite complex model definitions). The idea then becomes to _sample the hyperparameter space_ - and you don't have a good idea of how well a given hyperparam combination `H` performs until a sufficient number of iterations. A viable process is: (1) obtain a number of `val_loss`-`H` pairs via early-stopping, (2) feed them to the meta-learner, (3) meta-learner suggests new `H` (e.g. just a different `a`); (4) repeat (2,3). If such an approach works for you, I'll write the full answer – OverLordGoldDragon Oct 25 '19 at 23:47
  • What kind of answer do you prefer? Hacky keras models with strange workarounds or a custom training loop with eager mode on? – Daniel Möller Oct 27 '19 at 00:11
  • @OverLordGoldDragon Thanks for the explanation. 1) You can consider a as a parameter as well noting that the final value of the loss is a function of a as well. So, why cannot we do gradient descent for it? 2) By looking at the hyperparameter space you mean doing grid search? 3) In your method, how does the meta learner suggest new a? – Albert Oct 28 '19 at 17:02
  • @DanielMöller Thanks. I am not sure what do you mean by a custom training loop with an eager mode on. But, I prefer the simplest method. – Albert Oct 28 '19 at 17:03
  • @Albert You 'can' treat `a` as a parameter, but it may not make much sense, and be highly counterproductive. _Ex_: `loss = a*func(y, y_pred)` -- thus, you can drive all loss, train and validation, to zero, via `a = 0` - with model learning nothing. It depends on `a`'s exact purpose, which you haven't specified - but whatever it is, I doubt per-iteration updates will be involved, for reasons I can explain in my answer. (2): grid search is one way, but not only. (3): [Bayesian Optimization](https://philipperemy.github.io/visualization/), and related methods. – OverLordGoldDragon Oct 28 '19 at 17:12
  • @OverLordGoldDragon Thanks. But, I still do not agree that teatig it as a parameter is counterproductive. Your example is OK for me and I will be happy as long as such an a exist (my function is different). The grid search cannot lead tothe optimal point, it can lead to points close ot it. Anyway, can you please write the solution which you think is better? – Albert Oct 28 '19 at 17:54
  • @OverLordGoldDragon I have posted a related question here https://datascience.stackexchange.com/questions/62323/grid-search-or-gradient-descent – Albert Oct 28 '19 at 17:54

1 Answers1

3

Create a custom layer to hold the trainable parameter. This layer will not return the inputs in its call, but we are going to have the inputs for complying with how you create layers.

class TrainableLossLayer(Layer):

    def __init__(self, a_initializer, **kwargs):
        super(TrainableLossLayer, self).__init__(**kwargs)
        self.a_initializer = keras.initializers.get(a_initializer)

    #method where weights are defined
    def build(self, input_shape):
        self.kernel = self.add_weight(name='kernel_a', 
                                  shape=(1,),
                                  initializer=self.a_initializer,
                                  trainable=True)
        self.built=True

    #method to define the layers operation (only return the weights)
    def call(self, inputs):
        return self.kernel

    #output shape
    def compute_output_shape(self, input_shape):
        return (1,)

Use the layer in your model to get a with any inputs (this is not compatible with a Sequential model):

a = TrainableLossLayer(a_init, name="somename")(anyInput)

Now, you can try to define your loss in a sort of ugly way:

def customLoss(yTrue,yPred):
    return (K.log(yTrue) - K.log(yPred))**2+a*yPred

If this works, then it's ready.


You can also try a more complicated model (if you don't want to use a in the loss jumping over the layers like that, this might cause problems in model saving/loading)

In this case, you will need that y_train goes in as an input instead of an output:

y_true_inputs = Input(...)

Your loss function will go into a Lambda layer taking all parameters properly:

def lambdaLoss(x):
    yTrue, yPred, alpha = x
    return (K.log(yTrue) - K.log(yPred))**2+alpha*yPred

loss = Lambda(lambdaLoss)([y_true_inputs, original_model_outputs, a])

Your model will output this loss:

model = Model([original_model_inputs, y_true_inputs], loss)

You will have a dummy loss function:

def dummyLoss(true, pred):
    return pred

model.compile(loss = dummyLoss, ...)

And train as:

model.fit([x_train, y_train], anything_maybe_None_or_np_zeros ,....)
Daniel Möller
  • 74,597
  • 15
  • 158
  • 180
  • I cannot think of a single practical example where this is a good idea, nor can I see how the gradient is even defined; how would you write the chain rule from loss to `da`? Whatever the case, a per-iteration update on a parameter directly in the loss function - I'm quite curious to see a useful application; @Albert , have a reference paper? – OverLordGoldDragon Nov 25 '19 at 15:06
  • Regardless, interesting solution, D. Möller – OverLordGoldDragon Nov 25 '19 at 15:07
  • How can't you see? It's clear. The loss is `pred`, and pred is a calculated loss inside the model, which uses everything a regular loss would use, with a few more things. – Daniel Möller Nov 26 '19 at 22:57
  • I didn't go scientifically about the usefulness of it, I just answered "how to do it". Mathematically there isn't a problem. The gradients are calculated exactly like any other gradient: Keras automatically goes from the loss propagating to the trainable weights. – Daniel Möller Nov 26 '19 at 23:00
  • About trainable inputs, once I used it for style transfer, worked great. – Daniel Möller Nov 26 '19 at 23:02
  • What I meant more by "how to get `da`" is how to define it meaningfully w.r.t. model - e.g., lambda l2-penalty is tied back to source weights. But admittedly, for some reason I've limited context to "standard" weight training - style transfer, activation maximization, and the like, which operate on _non-weight_ parameters, may indeed use it. -- Alright, arrow up – OverLordGoldDragon Nov 27 '19 at 06:05