46

Given a matrix $A$ and column vector $x$, what is the derivative of $Ax$ with respect to $x^T$ i.e. $\frac{d(Ax)}{d(x^T)}$, where $x^T$ is the transpose of $x$?

Side note - my goal is to get the known derivative formula $\frac{d(x^TAx)}{dx} = x^T(A^T + A)$ from the above rule and the chain rule.

Asaf R
  • 571
  • 1
  • 5
  • 5

4 Answers4

50

Let $f(x) = x^TAx$ and you want to evaluate $\frac{df(x)}{dx}$. This is nothing but the gradient of $f(x)$.

There are two ways to represent the gradient one as a row vector or as a column vector. From what you have written, your representation of the gradient is as a row vector.

First make sure to get the dimensions of all the vectors and matrices in place.

Here $x \in \mathbb{R}^{n \times 1}$, $A \in \mathbb{R}^{n \times n}$ and $f(x) \in \mathbb{R}$

This will help you to make sure that your arithmetic operations are performed on vectors of appropriate dimensions.

Now lets move on to the differentiation.

All you need to know are the following rules for vector differentiation.

$$\frac{d(x^Ta)}{dx} = \frac{d(a^Tx)}{dx} = a^T$$ where $x,a \in \mathbb{R}^{n \times 1}$.

Note that $x^Ta = a^Tx$ since it is a scalar and the equation above can be derived easily.

(Some people follow a different convention i.e. treating the derivative as a column vector instead of a row vector. Make sure to stick to your convention and you will end up with the same conclusion in the end)

Make use of the above results to get,

$$\frac{d(x^TAx)}{dx} = x^T A^T + x^T A$$ Use product rule to get the above result i.e. first take $Ax$ as constant and then take $x^T A$ as constant.

So, $$\frac{df(x)}{dx} = x^T(A^T + A)$$

Pro Q
  • 785
  • 6
  • 17
  • 2
    I think you've got an x too much at the end of your last line (and a superfluous pair of parentheses). Other than that, I agree with everything, but it's not an answer to the question as I understand it. – joriki Feb 06 '11 at 18:57
  • @joriki: Thanks for pointing that out. –  Feb 06 '11 at 19:05
  • 1
    I think my problem was failing to notice that Ax is a vector. How small errors lead to a big waste of time... Thank you for the detailed answer. – Asaf R Feb 08 '11 at 00:32
  • 9
    Is this really called the chain rule? I've always called this the product rule. $\frac{d(u(x)\cdot v(x)}{dx} = \frac{du}{dx}(x)v(x)+\frac{dv}{dx}(x)u(x)$ (And the chain rule would be that $\frac{d(u(v(x))}{dx} = \frac{du}{dx}\left(v(x)\right)\cdot \frac{dv}{dx}\left(x\right)$) – lucidbrot Feb 23 '18 at 07:19
  • @lucidbrot: You're totally right. It's the **product rule**. – Catbuilts Feb 05 '21 at 06:07
8

I think there is no such thing. $\mbox{d}(x^\mbox{T}Ax)/\mbox{d}x$ is something that, when multiplied by the change $\mbox{d}x$ in $x$, yields the change $\mbox{d}(x^\mbox{T}Ax)$ in $x^\mbox{T}Ax$. Such a thing exists and is given by the formula you quote. $\mbox{d}(Ax)/\mbox{d}(x^\mbox{T})$ would have to be something that, when multiplied by the change $\mbox{d}x^\mbox{T}$ in $x^\mbox{T}$, yields the change $\mbox{d}Ax$ in $Ax$. No such thing exists, since $x^\mbox{T}$ is a $1 \times n$ row vector and $Ax$ is an $n \times 1$ column vector.

If your main goal is to derive the derivative formula, here's a derivation:

$(x^\mbox{T} + \mbox{d}x^\mbox{T})A(x + \mbox{d}x) = x^\mbox{T}Ax + \mbox{d}x^\mbox{T}Ax + x^\mbox{T}A\mbox{d}x + \mbox{d}x^\mbox{T}A\mbox{d}x =$

$=x^\mbox{T}Ax + x^\mbox{T}A^\mbox{T}\mbox{d}x + x^\mbox{T}A\mbox{d}x + O (\lVert \mbox{d}x \rVert^2) = x^\mbox{T}Ax + x^\mbox{T}(A^\mbox{T} + A)\mbox{d}x + O (\lVert \mbox{d}x \rVert^2)$

joriki
  • 215,929
  • 14
  • 263
  • 474
  • @joriki : It is strong to say that there is no such thing. Usually this is interpreted as the gradient of a function. –  Feb 06 '11 at 18:23
  • I think one of us misunderstood the question. In your answer, you answered the question what df/dx = d(x^T A x)/dx is. Yes, this exists and is the gradient of f. But if I understand the question correctly, it asked what d(Ax)/d(x^T) is, where A is a matrix. I still think that no such thing exists. – joriki Feb 06 '11 at 18:54
  • @joriki: The interpretation of $\frac{d(Ax)}{dx^T}$ is the Hessian right? which in this case is nothing but $A$. –  Feb 06 '11 at 19:10
  • One way to think of this is as the gradient of the vector function, where each row corresponds to the gradient of each component of this vector function. –  Feb 06 '11 at 19:19
  • 2
    I don't think so. $\mbox{d}(Ax)/\mbox{d}x$ is $A$, and I don't see how it could make sense to have $\mbox{d}(Ax)/\mbox{d}x = \mbox{d}(Ax)/\mbox{d}x^\mbox{T}$. See http://en.wikipedia.org/wiki/Matrix_derivative#Derivative_of_linear_functions. All formulas there are well-defined in the sense I discuss in my answer, and they have $\mbox{d}(Ax)/\mbox{d}x$ and $\mbox{d}(x^\mbox{T}A)/\mbox{d}x^\mbox{T}$, but not $\mbox{d}(Ax)/\mbox{d}x^\mbox{T}$. – joriki Feb 06 '11 at 19:19
  • It is more of a question of notation. If I define $df/dx^T$ as the column vector then these things will interchange. To argue along the your answer, you have $(x^\mbox{T} + \mbox{d}x^\mbox{T})A(x + \mbox{d}x) = x^\mbox{T}Ax + \mbox{d}x^\mbox{T}Ax + x^\mbox{T}A\mbox{d}x + \mbox{d}x^\mbox{T}A\mbox{d}x = x^\mbox{T}Ax + \mbox{d}x^\mbox{T}Ax + \mbox{d}x^\mbox{T}A^\mbox{T}x + \mbox{d}x^\mbox{T}A\mbox{d}x = x^TAx + dx^TAx + dx^TA^Tx + \mathcal{O}(||dx||^2)$. So if we define $df/dx^T $ as a column vector then $d(x^TAx)/dx^T = (A+A^T)x$ –  Feb 06 '11 at 19:37
  • It is certainly a question of notation, and notation can always be defined as one likes, but two arguments in favour of my notation are a) Wikipedia uses it and b) it seems very desirable that quite generally $\mbox{d}\alpha/\mbox{d}\beta \cdot \mbox{d}\beta = \mbox{d}\alpha$, and the notation for vector and matrix derivatives shouldn't violate that principle without good reason. Certainly, $\mbox{d}(x^\mbox{T}Ax)/\mbox{d}x^\mbox{T} = (A + A^\mbox{T})x$, as you wrote, but that is entirely consistent with my arguments and doesn't imply that $\mbox{d}(Ax)/\mbox{d}x^\mbox{T}$ is well-defined. – joriki Feb 06 '11 at 19:43
  • To be clear if $x$ is a column vector and $f$ is a scalar then, $df = \text{grad} \times dx$ in this case where grad is a row vector. If $df = dx^T \times \text{grad}$, then grad is a column vector. If $x$ is a row vector and $f$ is a scalar then, $df = \text{grad} \times dx^T$ in this case where grad is a row vector. If $df = dx \times \text{grad}$, then grad is a column vector. –  Feb 06 '11 at 19:44
  • Yes, you can decide which side of the multiplication you want to put the derivative on, but the problem with $\mbox{d}(Ax)/\mbox{d}x^\mbox{T}$ goes beyond that -- this doesn't make sense no matter what order of multplication you choose, since $x^\mbox{T}$ is a $1\times n$ row vector and $Ax$ is an $n\times1$ column vector -- there's nothing you can multiply with on either side that will turn the one into the other. – joriki Feb 06 '11 at 19:50
  • http://matrixcookbook.com/ Equation 89 and 90 of the matrix cookbook seems to clarify this. –  Feb 06 '11 at 19:51
  • Interesting, but I don't see how it clarifies this :-) Equation 89 uses different notation from the one thing that we both agreed on, namely $\mbox{d}f/\mbox{d}x = x^\mbox{T}(A^\mbox{T} + A)$ -- they use the transpose, which is OK if you define the notation that way -- but both (89) and (90) define things that I believe exist, even if I'd write them the other way around; nothing I can find in the cookbook indicates that it makes sense to write $\mbox{d}(Ax)/\mbox{d}x^\mbox{T}$. – joriki Feb 06 '11 at 20:22
  • I understand the issue you have. So essentially our argument boils down, in some sense, to what is $dx/dx$ and $dx/dx^T$, when $dx$ is a column vector. My definition is $dx/dx = 1$ i.e. $dx = dx$ and $dx/dx^T = I$ i.e. $dx = I (dx^T)^T$. Will that take care of the issues? –  Feb 06 '11 at 20:54
6

Mathematicians kill each other about derivatives and gradients. Do not be surprised if the students do not understand one word about this subject. The previous havocs are partly caused by the Matrix Cookbook, a book that should be blacklisted. Everyone has their own definition. $\dfrac{d(f(x))}{dx}$ means either a derivative or a gradient (scandalous). We could write $D_xf$ as the derivative and $\nabla _xf$ as the gradient. The derivative is a linear application and the gradient is a vector if we accept the following definition: let $f:E\rightarrow \mathbb{R}$ where $E$ is an euclidean space. Then, for every $h\in E$, $D_xf(h)=<\nabla_x(f),h>$. In particular $x\rightarrow x^TAx$ has a gradient but $x\rightarrow Ax$ has not ! Using the previous definitions, one has (up to unintentional mistakes):

Let $f:x\rightarrow Ax$ where $A\in M_n$ ; then $D_xf=A$ (no problem). On the other hand $x\rightarrow x^T$ is a bijection (a simple change of variable !) ; then we can give meaning to the derivative of $Ax$ with respect to $x^T$: consider the function $g:x^T\rightarrow A(x^T)^T$ ; the required function is $D_{x^T}g:h^T\rightarrow Ah$ where $h$ is a vector ; note that $D_{x^T}g$ is a constant. EDIT: if we choose the bases $e_1^T,\cdots,e_n^T$ and $e_1,\cdots,e_n$ (the second one is the canonical basis), then the matrix associated to $D_{x^T}g$ is $A$ again.

Let $\phi:x\rightarrow x^TAx$ ; $D_x\phi:h\rightarrow h^TAx+x^TAh=x^T(A+A^T)h$. Moreover $<\nabla_x(f),h>=x^T(A+A^T)h$, that is ${\nabla_x(f)}^Th=x^T(A+A^T)h$. By identification, $\nabla_x(f)=(A+A^T)x$, a vector (formula (89) in the detestable matrix Cookbook !) ; in particular, the solution above $x^T(A+A^T)$ is not a vector !

  • 1
    Can you recommend a nice textbook explaining the difference between $D$ and $\nabla$? – becko May 26 '19 at 23:02
3

As Sivaram points out, you must define your convention about rows/colums derivatives and just be consistent.

For example, you could define the derivative of a column vector with respect to a row vector as (assuming the letters represent column vectors) as matrix:

$\displaystyle \frac{d(y)}{dx^T} = D$ with $d_{i,j} = \frac{d(y_i)}{dx^j}$

And that will work (it will be consistent). For example, you get $\displaystyle \frac{d(Ax)}{dx^T} = A$

But it's not so simple to apply this -and the product rule of derivation- to deduce your identity, because you get to different derivatives: a row with respect to a row and a column respect to row, and you can't (at least without further justification) mix them.

Of course, if the matrix is simmetric all is simpler.

leonbloy
  • 56,395
  • 9
  • 64
  • 139
  • OK, to make my point of view more precise, I should say: You are both right that if you want to define the notation this way you can do it; but in the notation that Asaf himself used for $\mbox{d}(x^\mbox{T}Ax)/\mbox{d}x$, and that is used on Wikipedia, it doesn't make sense to write $\mbox{d}(Ax)/\mbox{d}x^\mbox{T}$. Can we agree on that? – joriki Feb 06 '11 at 20:28