In linear regression, the loss function is expressed as
$$\frac1N \left\|XW-Y\right\|_{\text{F}}^2$$
where $X, W, Y$ are matrices. Taking derivative w.r.t $W$ yields
$$\frac 2N \, X^T(XW-Y)$$
Why is this so?
In linear regression, the loss function is expressed as
$$\frac1N \left\|XW-Y\right\|_{\text{F}}^2$$
where $X, W, Y$ are matrices. Taking derivative w.r.t $W$ yields
$$\frac 2N \, X^T(XW-Y)$$
Why is this so?
Let
$$\begin{array}{rl} f (\mathrm W) &:= \| \mathrm X \mathrm W - \mathrm Y \|_{\text{F}}^2 = \mbox{tr} \left( (\mathrm X \mathrm W - \mathrm Y)^{\top} (\mathrm X \mathrm W - \mathrm Y) \right)\\ &\,= \mbox{tr} \left( \mathrm W^{\top} \mathrm X^{\top} \mathrm X \mathrm W - \mathrm Y^{\top} \mathrm X \mathrm W - \mathrm W^{\top} \mathrm X^{\top} \mathrm Y + \mathrm Y^{\top} \mathrm Y \right)\end{array}$$
Differentiating with respect to $\mathrm W$,
$$\nabla_{\mathrm W} f (\mathrm W) = 2 \, \mathrm X^{\top} \mathrm X \mathrm W - 2 \, \mathrm X^{\top} \mathrm Y = \color{blue}{2 \, \mathrm X^{\top} \left( \mathrm X \mathrm W - \mathrm Y \right)}$$
Let $X=(x_{ij})_{ij}$ and similarly for the other matrices. We are trying to differentiate $$ \|XW-Y\|^2=\sum_{i,j}(x_{ik}w_{kj}-y_{ij})^2\qquad (\star) $$ with respect to $W$. The result will be a matrix whose $(i,j)$ entry is the derivative of $(\star)$ with respect to the variable $w_{ij}$.
So think of $(i,j)$ as being fixed now. Only some of the terms in $(\star)$ depend on $w_{ij}$. Taking their derivative gives $$ \frac{d\|XW-Y\|^2}{dw_{ij}}=\sum_{k}2x_{ki}(x_{ki}w_{ij}-y_{kj})=\left[2X^T(XW-Y)\right]_{i,j}. $$
Just want to have more details on the process. The process should be Denote $X = [x_{ij}], W = [w_{ij}], Y = [y_{ij}]$, then we have $$ \left \| XW - Y \right \|^{2} = \sum_{k, j} (\sum_{i} x_{ki} w_{ij} - y_{kj})^{2}, $$ This is a scalar and by taking the derivative w.r.t. the matrix $W$ we get a matrix. By taking $i, j$ as the known number, we get $$ \frac{d \left \| XW - Y \right \|^{2}}{d w_{ij}} = \sum_{k} 2x_{ki} (\sum_{i} x_{ki} w_{ij} - y_{kj})\\ = \sum_{k} 2x_{ki} (XW - Y)_{kj} \\ = [2 X^{T} (XW - Y)]_{ij} $$ Thus we have $$ \frac{d \left \| XW - Y \right \|^{2}}{d W} = 2 X^{T} (XW - Y) $$ First time answering a question, hope it is right, thanks!
Roughly speaking, the $\textbf{Jacobian}$ of $f$ at point $x$ is the matrix/tensor $B$ such that we have \begin{equation}f(x+\delta)=f(x) + B\delta+ o(\|\delta\|).\end{equation} So, if $$f(W)=\|XW-Y\|_F^2,$$ then \begin{equation} f(W+\delta)=\|X(W+\delta)-Y\|_F^2=\|XW-Y+X\delta\|_F^2=\|XW-Y\|_F^2+2\langle XW-Y,X\delta \rangle +\|X\delta\|_F^2. \end{equation} Note that we then have \begin{equation} f(W+\delta)=f(W)+2\langle X^T( XW-Y),\delta \rangle +o(\|\delta\|)= f(W)+2\left(X^T( XW-Y)\right)^T\delta +o(\|\delta\|). \end{equation} So, the Jacobian of $f$ is $2\left(X^T( XW-Y)\right)^T$, implying that the gradient is its transpose.
This Taylor expansion idea is a smart trick to make your life easier while taking derivatives.