I am trying to wrap my head around back-propagation in a neural network with a Softmax classifier, which uses the Softmax function:
\begin{equation} p_j = \frac{e^{o_j}}{\sum_k e^{o_k}} \end{equation}
This is used in a loss function of the form
\begin{equation}L = -\sum_j y_j \log p_j,\end{equation}
where $o$ is a vector. I need the derivative of $L$ with respect to $o$. Now if my derivatives are right,
\begin{equation} \frac{\partial p_j}{\partial o_i} = p_i(1 - p_i),\quad i = j \end{equation}
and
\begin{equation} \frac{\partial p_j}{\partial o_i} = -p_i p_j,\quad i \neq j. \end{equation}
Using this result we obtain
\begin{eqnarray} \frac{\partial L}{\partial o_i} &=& - \left (y_i (1 - p_i) + \sum_{k\neq i}-p_k y_k \right )\\ &=&p_i y_i - y_i + \sum_{k\neq i} p_k y_k\\ &=& \left (\sum_i p_i y_i \right ) - y_i \end{eqnarray}
According to slides I'm using, however, the result should be
\begin{equation} \frac{\partial L}{\partial o_i} = p_i - y_i. \end{equation}
Can someone please tell me where I'm going wrong?