8

I am currently implementing a CNN in plain numpy and have a brief question regarding a special case of the backpropagation for a max-pool layer:

While it is clear that the gradient with respect to non-maximum values vanishes, I am not sure about the case where several entries of a slice are equal to the maximum value. Strictly speaking, the function should not be differentiable at this "point". However, I would assume that one can pick a subgradient from the corresponding subdifferential (similar to choosing the subgradient "0" for the Relu function at x=0).

Hence, I am wondering if it would be sufficient to simply form the gradient with respect to one of the maximum values and treat the remaining maxium values as non-maximum values.

If that is the case, would it be advisable to randomize the selection of the maximum value to avoid bias or is it okay to always pick the first maximum value?

x3t2h
  • 81
  • 3
  • 1
    This is a brilliant question, I haven't actually thought about it before, and I think it remains largely unaddressed for a couple of reasons. 1. When weights are initialised properly, values for outputs tend to have quite a few decimal places, making the chance of them actually being equivalent very close to 0, so it barely happens "in nature" 2. When it does happen, distributing the values vs just giving it to a single node makes minimal difference when it happens so rarely. That being said, I would discourage any sort of random distribution, as it removes the determinism out of a cnn. – Recessive Aug 16 '19 at 06:02
  • 1
    For consistency, Tensorflow passes the gradient to the first of the equal values – Nihar Karve Jun 02 '20 at 06:35
  • Similar question at [https://datascience.stackexchange.com/questions/11699/backprop-through-max-pooling-layers](https://datascience.stackexchange.com/questions/11699/backprop-through-max-pooling-layers) – mr_mo May 14 '21 at 18:47

0 Answers0