I am trying to wrap my head around back-propagation in a neural network with a Softmax classifier, which uses the Softmax function:

\begin{equation} p_j = \frac{e^{o_j}}{\sum_k e^{o_k}} \end{equation}

This is used in a loss function of the form

\begin{equation}L = -\sum_j y_j \log p_j,\end{equation}

where $o$ is a vector. I need the derivative of $L$ with respect to $o$. Now if my derivatives are right,

\begin{equation} \frac{\partial p_j}{\partial o_i} = p_i(1 - p_i),\quad i = j \end{equation}


\begin{equation} \frac{\partial p_j}{\partial o_i} = -p_i p_j,\quad i \neq j. \end{equation}

Using this result we obtain

\begin{eqnarray} \frac{\partial L}{\partial o_i} &=& - \left (y_i (1 - p_i) + \sum_{k\neq i}-p_k y_k \right )\\ &=&p_i y_i - y_i + \sum_{k\neq i} p_k y_k\\ &=& \left (\sum_i p_i y_i \right ) - y_i \end{eqnarray}

According to slides I'm using, however, the result should be

\begin{equation} \frac{\partial L}{\partial o_i} = p_i - y_i. \end{equation}

Can someone please tell me where I'm going wrong?

  • 995
  • 1
  • 8
  • 21
Moos Hueting
  • 2,107
  • 3
  • 11
  • 10
  • 19
    For others who end up here, this thread is about computing the derivative of the cross-entropy function, which is the cost function often used with a softmax layer (though the derivative of the cross-entropy function uses the derivative of the softmax, -p_k * y_k, in the equation above). Eli Bendersky has an awesome derivation of the softmax and its associated cost function here: https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/ – duhaime Jan 01 '18 at 17:52

1 Answers1


Your derivatives $\large \frac{\partial p_j}{\partial o_i}$ are indeed correct, however there is an error when you differentiate the loss function $L$ with respect to $o_i$.

We have the following (where I have highlighted in $\color{red}{red}$ where you have gone wrong) $$\frac{\partial L}{\partial o_i}=-\sum_ky_k\frac{\partial \log p_k}{\partial o_i}=-\sum_ky_k\frac{1}{p_k}\frac{\partial p_k}{\partial o_i}\\=-y_i(1-p_i)-\sum_{k\neq i}y_k\frac{1}{p_k}({\color{red}{-p_kp_i}})\\=-y_i(1-p_i)+\sum_{k\neq i}y_k({\color{red}{p_i}})\\=-y_i+\color{blue}{y_ip_i+\sum_{k\neq i}y_k({p_i})}\\=\color{blue}{p_i\left(\sum_ky_k\right)}-y_i=p_i-y_i$$ given that $\sum_ky_k=1$ from the slides (as $y$ is a vector with only one non-zero element, which is $1$).

Alijah Ahmed
  • 11,259
  • 2
  • 19
  • 20
  • 1
    Ah, yes, I see. And I'm not even tired - no one to blame but me! Thanks for your help, Alijah. – Moos Hueting Sep 25 '14 at 17:30
  • 1
    Moos, you are most welcome. Glad to be of help. – Alijah Ahmed Sep 25 '14 at 17:54
  • 1
    I am unsure how to get to the last line from the previous line in this answer. Would it be possible to post more information? – FatalMojo Sep 19 '15 at 05:21
  • 11
    @FatalMojo I have added an extra line between the last and the penultimate lines, and highlighted some terms in blue. – Alijah Ahmed Sep 19 '15 at 09:10
  • 1
    @AlijahAhmed Can i ask how did you get the first line. And how did it go to the second line – aceminer May 25 '17 at 14:20
  • 6
    @aceminer For the first line, the $y_k$ do not depend on $o_j$, so they are constants. This leads to $\frac{\partial L}{\partial o_i}=-\sum_k\color{red}{y_k}\frac{\partial \log p_k}{\partial o_i}$. Then, you use the differential identity $\frac{\partial \log{f(x)}}{\partial x}=\frac{1}{f(x)}\frac{\partial f(x)}{\partial x}$, leading to result $-\sum_ky_k\frac{1}{p_k}\frac{\partial p_k}{\partial o_i}$, at the end of the first line. For the second line, we use result $\frac{\partial p_j}{\partial o_i} = p_i(1 - p_i),\quad i = j$ and $\frac{\partial p_j}{\partial o_i} = -p_i p_j,\quad i \neq j$. – Alijah Ahmed Jun 04 '17 at 11:51
  • 2
    Awesome question-awesome answer now i feel calmness inside , thanks – MIRMIX Nov 14 '17 at 22:33
  • Can someone explain as to how this result be generalized to $\frac{\partial L}{\partial o}$ i.e., derivative of the loss with respect to the vector instead of the single entry of a vector. – amj Nov 28 '19 at 21:10
  • Here is the more streched out version for those like me who get confused by indices https://alexcpn.github.io/html/NN/ml/7_cnn_network/ – Alex Punnen Feb 16 '22 at 12:32