Derivative of Softmax loss function

Question

I am trying to wrap my head around back-propagation in a neural network with a Softmax classifier, which uses the Softmax function:

\begin{equation} p_j = \frac{e^{o_j}}{\sum_k e^{o_k}} \end{equation}

This is used in a loss function of the form

\begin{equation}L = -\sum_j y_j \log p_j,\end{equation}

where $o$ is a vector. I need the derivative of $L$ with respect to $o$. Now if my derivatives are right,

\begin{equation} \frac{\partial p_j}{\partial o_i} = p_i(1 - p_i),\quad i = j \end{equation}

and

\begin{equation} \frac{\partial p_j}{\partial o_i} = -p_i p_j,\quad i \neq j. \end{equation}

Using this result we obtain

\begin{eqnarray} \frac{\partial L}{\partial o_i} &=& - \left (y_i (1 - p_i) + \sum_{k\neq i}-p_k y_k \right )\\ &=&p_i y_i - y_i + \sum_{k\neq i} p_k y_k\\ &=& \left (\sum_i p_i y_i \right ) - y_i \end{eqnarray}

According to slides I'm using, however, the result should be

\begin{equation} \frac{\partial L}{\partial o_i} = p_i - y_i. \end{equation}

Can someone please tell me where I'm going wrong?

For others who end up here, this thread is about computing the derivative of the cross-entropy function, which is the cost function often used with a softmax layer (though the derivative of the cross-entropy function uses the derivative of the softmax, -p_k * y_k, in the equation above). Eli Bendersky has an awesome derivation of the softmax and its associated cost function here: https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/ — duhaime, Jan 01 '18 at 17:52

Alijah Ahmed · Accepted Answer · 2015-09-19T09:09:31.743

171

Your derivatives $\large \frac{\partial p_j}{\partial o_i}$ are indeed correct, however there is an error when you differentiate the loss function $L$ with respect to $o_i$.

We have the following (where I have highlighted in $\color{red}{red}$ where you have gone wrong) $$\frac{\partial L}{\partial o_i}=-\sum_ky_k\frac{\partial \log p_k}{\partial o_i}=-\sum_ky_k\frac{1}{p_k}\frac{\partial p_k}{\partial o_i}\\=-y_i(1-p_i)-\sum_{k\neq i}y_k\frac{1}{p_k}({\color{red}{-p_kp_i}})\\=-y_i(1-p_i)+\sum_{k\neq i}y_k({\color{red}{p_i}})\\=-y_i+\color{blue}{y_ip_i+\sum_{k\neq i}y_k({p_i})}\\=\color{blue}{p_i\left(\sum_ky_k\right)}-y_i=p_i-y_i$$ given that $\sum_ky_k=1$ from the slides (as $y$ is a vector with only one non-zero element, which is $1$).

edited Sep 19 '15 at 09:09

answered Sep 25 '14 at 17:27

Alijah Ahmed

11,259
2
19
20

1

Ah, yes, I see. And I'm not even tired - no one to blame but me! Thanks for your help, Alijah. – Moos Hueting Sep 25 '14 at 17:30
1

Moos, you are most welcome. Glad to be of help. – Alijah Ahmed Sep 25 '14 at 17:54
1

I am unsure how to get to the last line from the previous line in this answer. Would it be possible to post more information? – FatalMojo Sep 19 '15 at 05:21
11

@FatalMojo I have added an extra line between the last and the penultimate lines, and highlighted some terms in blue. – Alijah Ahmed Sep 19 '15 at 09:10
1

@AlijahAhmed Can i ask how did you get the first line. And how did it go to the second line – aceminer May 25 '17 at 14:20
6

@aceminer For the first line, the $y_k$ do not depend on $o_j$, so they are constants. This leads to $\frac{\partial L}{\partial o_i}=-\sum_k\color{red}{y_k}\frac{\partial \log p_k}{\partial o_i}$. Then, you use the differential identity $\frac{\partial \log{f(x)}}{\partial x}=\frac{1}{f(x)}\frac{\partial f(x)}{\partial x}$, leading to result $-\sum_ky_k\frac{1}{p_k}\frac{\partial p_k}{\partial o_i}$, at the end of the first line. For the second line, we use result $\frac{\partial p_j}{\partial o_i} = p_i(1 - p_i),\quad i = j$ and $\frac{\partial p_j}{\partial o_i} = -p_i p_j,\quad i \neq j$. – Alijah Ahmed Jun 04 '17 at 11:51
2

Awesome question-awesome answer now i feel calmness inside , thanks – MIRMIX Nov 14 '17 at 22:33
Can someone explain as to how this result be generalized to $\frac{\partial L}{\partial o}$ i.e., derivative of the loss with respect to the vector instead of the single entry of a vector. – amj Nov 28 '19 at 21:10
Here is the more streched out version for those like me who get confused by indices https://alexcpn.github.io/html/NN/ml/7_cnn_network/ – Alex Punnen Feb 16 '22 at 12:32

Derivative of Softmax loss function

1 Answers1

Linked