24

I've been thinking about this question already for a long time and I've just encountered it again in the following lemma:

$$f(x) = g(Ax + b) \implies \nabla f = A^T \nabla g(Ax + b) $$

This lemma makes intuitive sense if you think of it as taking the $x$ to the space $Ax$, calculating the gradient and then taking the result back to the original space. But why is "taking the result back" realised as $A^T$ and not $A^{-1}$?

By doing the calculations you get $A^T$, no doubt, but I always expect an inverse. In general, when should I expect a transpose and when an inverse? Where are they similar and where do they differ?

Rodrigo de Azevedo
  • 18,977
  • 5
  • 36
  • 95
LionCoder
  • 1,209
  • 6
  • 19
  • 11
    Think about the 1-dimensional case: if $f(x) = g(ax+b)$, then $f'(x) = g'(ax+b)a$, not $g'(ax+b)/a$. So it can't be the inverse; it must be the transpose. – Toby Bartels May 15 '18 at 18:47
  • 3
    Also, the row matrix $\mathrm{D}f = (\nabla{f})^\top$ is more fundamental than the column matrix $\nabla{f}$. So you should be thinking $\mathrm{D}f(x) = \mathrm{D}g(Ax+b)A$ to begin with; then you can explicitly take the transpose of each side, if you must, to get $\nabla{f}(x) = A^\top\nabla{g}(Ax+b)$. But fundamentally, it's not about the transpose *or* the inverse. – Toby Bartels May 15 '18 at 18:52

5 Answers5

19

We usually see matrices as linear transformations. The inverse of $A$, when it exists, means simply "reversing" what $A$ does as a function. The transpose originates in a different point of view.

So we have vector spaces $X,Y$, and $A:X\to Y$ is linear. For many reasons, we often look at the linear functionals on the space; that way we get the dual $$ X^*=\{f:X\to\mathbb R:\ f\ \text{ is linear}\}, $$ and correspondingly $Y^*$. Now the map $A$ induces a natural map $A^*:Y^*\to X^*$, by $$ (A^*g)(x)=g(Ax). $$ In the particular case where $X=\mathbb R^n$, $Y=\mathbb R^m$, one can check that $X^*=X$ and $Y^*=Y$, in the sense that all linear functionals $f:\mathbb R^n\to\mathbb R$ are of the form $f(x)=y^Tx$ for some fixed $y\in\mathbb R^n$. In this situation $A$ is an $m\times n$ matrix, and the matrix of $A^*$ is the transpose of $A$.

Martin Argerami
  • 179,760
  • 14
  • 120
  • 240
  • 3
    You're the only one who answered my question. I wasn't asking about how to derive the formula, but everyone tried to derive it for me. You actually gave me insight into where the transpose comes from. Thank you – LionCoder May 15 '18 at 23:25
  • 2
    Glad I could help. The whole thing is more visible in more abstract spaces than $\mathbb R^n$. In particular the adjoint plays a big role when dealing with HIlbert spaces and their operators. And the star in "C$^*$-algebra", for instance, comes from the notation for the adjoint. – Martin Argerami May 15 '18 at 23:28
  • 1
    Just to add a little bit, you can then treat $\nabla f(x)$ as a linear function, whose value $[\nabla f(x)](y)$ at $y$ is the dot product of $y$ and the gradient. In particular, the gradient is the function for which $g(x) = f(x_0) + [\nabla f(x_0)](x)$ is the best affine approximation of $f$ at $x_0$. Then $f(Ax_0) + [\nabla f(Ax_0)](Ax) = f(Ax_0) + [A^* \nabla f(Ax_0)](x)$ is the best linear approximation of $f(Ax)$ at $x_0$. The formula in the question is just the matrix representation of this. – Sasho Nikolov May 16 '18 at 13:20
10

Something weird is going on here. I'm assuming $g: \mathbb R^m \to \mathbb R$ and say $A$ is an $m\times n$ matrix. Let $\mathcal a(x): \mathbb R^n \to \mathbb R^m, x \mapsto Ax + b$ be the corresponding affine transformation, so that $f = g \circ a$. The chain rule says $Df(x) = Dg(a(x)) Da(x)$.

The Jacobian realization of $Dg$ is $\nabla g$ and is an $1\times m$ matrix (row vector), while the Jacobian for $a$ is $A$, an $m \times n $ matrix. The dimensions all agree, since this would make $\nabla f$ a $1\times n$ matrix, which agrees with the notion that the derivative of $f$ is a linear map $\mathbb R^n \to \mathbb R$.

So what I suspect is happening is some identification of $\mathbb R^n$ with its dual space under the Euclidean inner product; that is, you're realizing the gradient as a column vector instead of a row vector. The transpose is precisely the way this is done. If $T: V \to W$ is a linear transformation, then its adjoint is $T^\dagger: W^* \to V^*$. But under the Euclidean inner product, you can identify $\mathbb R^n \cong (\mathbb R^n)^*$, so $$ (\nabla g(a(x)) A)^T = A^T [\nabla g(a(x))]^T = A^T \nabla g(a(x))$$ where we're abusing notation by identifying the row vector $\nabla g$ with the column vector $\nabla g$. This hidden identification is likely what is confusing you.

S.Micheals
  • 121
  • 5
  • 7
    I really wish that we could just teach that the gradient is a row vector in multivariable calculus from the beginning. This confusion is just not necessary, and exacerbates the process of learning about the Jacobian (which is basically a "column vector of gradients"). I understand why this isn't practical, though. – Ian May 15 '18 at 15:38
  • @Ian I got yelled at once by a very senior faculty member for defining the derivative of the map $g: \mathbb R^n \to \mathbb R$ as row vector, rather than distinguishing the linear transformation from its Jacobian representation. This was in a non-advanced course :S – S.Micheals May 15 '18 at 15:43
  • @Ian Just curious. Why isn't that practical? – user1551 May 15 '18 at 22:46
  • @user1551 There's no way to require all multivariable students to have enough familiarity with linear algebra to really understand the difference between row vectors and column vectors. – Ian May 16 '18 at 01:23
  • 3
    @Ian I wish we could stop talking about _row or column vectors entirely_, and instead start from abstract vector spaces and their dual spaces. – leftaroundabout May 16 '18 at 12:00
  • @leftaroundabout This is ideal, but dealing with higher order derivatives becomes a bit nasty don't you think? If $\mathcal L^n(V,W)$ are the $n$-linear maps from $V$ to $W$, then even the second derivative requires that students are familiar with the isomorphism $\mathcal L^1(V, \mathcal L^1(V,W)) \cong \mathcal L^2(V,W)$, and that's probably asking a bit much when most students have little linear algebra, let alone multilinear algebra. – S.Micheals May 16 '18 at 17:44
8

Notice using the chain rule that $$D_p g(Av+b)=\langle\nabla g(Ap+b),Av\rangle=\langle A^T\nabla g(Ap+b),v\rangle.$$ Now compare to $D_pf(v)=\langle\nabla f(p),v\rangle$.

Michael Hoppe
  • 16,684
  • 3
  • 28
  • 47
4

Here you are not "taking the result back to the original space", you are chaining transforms.

If you think of a linear transform applied to a vector, it's a bunch of dot products, of the rows of the array by the column vector and

$$\vec x\cdot\vec y\equiv x^Ty.$$

4

Taking the directional derivative of $f (\mathrm x) := g (\mathrm A \mathrm x + \mathrm b)$ in the direction of $\rm v$ at $\rm x$,

$$\lim_{h \to 0} \frac{f (\mathrm x + h \mathrm v) - f (\mathrm x)}{h} = \langle \nabla g (\mathrm A \mathrm x + \mathrm b), \mathrm A \mathrm v \rangle = \langle \mathrm A \mathrm v, \nabla g (\mathrm A \mathrm x + \mathrm b) \rangle = \langle \mathrm v, \mathrm A^\top \nabla g (\mathrm A \mathrm x + \mathrm b) \rangle$$

and, thus, the gradient of $f$ is

$$\nabla f (\mathrm x) = \mathrm A^\top \nabla g (\mathrm A \mathrm x + \mathrm b)$$

Rodrigo de Azevedo
  • 18,977
  • 5
  • 36
  • 95