31

For the Quadratic Form $X^TAX; X\in\mathbb{R}^n, A\in\mathbb{R}^{n \times n}$ (which simplifies to $\Sigma_{i=0}^n\Sigma_{j=0}^nA_{ij}x_ix_j$), I tried to take the derivative wrt. X ($\Delta_X X^TAX$) and ended up with the following:

The $k^{th}$ element of the derivative represented as

$\Delta_{X_k}X^TAX=[\Sigma_{i=1}^n(A_{ik}x_k+A_{ki})x_i] + A_{kk}x_k(1-x_k)$

Does this result look right? Is there an alternative form?

I'm trying to get to the $\mu_0$ of Gaussian Discriminant Analysis by maximizing the log likelihood and I need to take the derivative of a Quadratic form. Either the result I mentioned above is wrong (shouldn't be because I went over my arithmetic several times) or the form I arrived at above is not the terribly useful to my problem (because I'm unable to proceed).

I can give more details about the problem or the steps I put down to arrive at the above result, but I didn't want to clutter to start off. Please let me know if more details are necessary.

Any link to related material is also much appreciated.

Rodrigo de Azevedo
  • 18,977
  • 5
  • 36
  • 95
Praveen
  • 411
  • 1
  • 4
  • 3

5 Answers5

57

Let $Q(x) = x^T A x$. Then expanding $Q(x+h)-Q(x)$ and dropping the higher order term, we get $DQ(x)(h) = x^TAh+h^TAx = x^TAh+x^TA^Th = x^T(A+A^T)h$, or more typically, $\frac{\partial Q(x)}{\partial x} = x^T(A+A^T)$.

Notice that the derivative with respect to a column vector is a row vector!

user541686
  • 12,494
  • 15
  • 48
  • 93
copper.hat
  • 161,568
  • 8
  • 96
  • 225
  • Could you comment on the difference expansion, please? – user191919 Feb 06 '14 at 21:00
  • 1
    What do you mean? Just compute $Q(x+h)-Q(x)$ explicitly. The only term missing above is $h^T A h$, and we have $|h^T A h| \le \|A\| \|h \|^2$, so the term is $O(\|h\|^2)$. – copper.hat Feb 06 '14 at 21:15
  • Can't see how $(x+h)^T A (x+h)$ would be obvious; I was hoping avoid opening the matrix. Any hint? – user191919 Feb 07 '14 at 02:32
  • I still don't understand what you are asking. Computing the derivative is much like computing the derivative of $x \mapsto x^2$ from first principles. I don't understand what you mean by 'opening the matrix'. – copper.hat Feb 07 '14 at 03:11
  • 1
    I don't see how I can expand $(x+h)^T A (x+h)$ so trivially. I mean literally, why $(x+h)^T A (x+h) = x^T A x + h^TAx+x^TAh + h^T A h$ and how can you see that so quickly. It just looks a messy summation for me. – user191919 Feb 07 '14 at 10:26
  • 4
    There is no need to explicitly compute the sums. Matrix multiplication is associative and distributive, so we can treat them like 'numbers' in this regard. We have $A(x+h) = Ax + Ah$, $(x+h)^TA = (x^T +h^T) A = x^TA + h^T A$, etc. – copper.hat Feb 07 '14 at 16:22
  • How did you get $x^TA^Th$ from $h^TAx$? – tmaric Jul 07 '16 at 10:08
  • In general, $(AB)^T=B^T A^T$. – copper.hat Jul 07 '16 at 13:26
  • 1
    And for a scalar $x^T = x$. – copper.hat Jul 07 '16 at 13:27
  • @copper.hat: thanks! – tmaric Jul 11 '16 at 09:27
  • 1
    "the derivative with respect to a column vector is a row vector!" --this is of course assuming you're using [numerator layout](https://en.wikipedia.org/wiki/Matrix_calculus#Layout_conventions) – Yibo Yang Oct 14 '16 at 23:52
  • @YiboYang: I think the above notation is fairly standard, I believe. – copper.hat Oct 14 '16 at 23:56
  • @copper.hat, thank you for your answers all of math stack exchange - I have used many of them. Now, I am wondering, how can you say that $\vert h^TAh \vert \le \vert\vert A \vert\vert \vert\vert h \vert\vert^2$ ? Is it somehow related to Cauchy Schwarz? I tried to derive it from Cauchy Schwarz but was unable to, because the L2-norm of a matrix is the largest singular value, and this threw me off a bit. Thanks again for your answers on stackexchange. Furthermore, how is the fact the it is of order $\vert h \vert^2$ enough to justify removing it? – makansij Sep 12 '17 at 15:12
  • 1
    @Sother: Cauchy Schwarz gives $|\langle h, Ah \rangle | \le \|h\| \|Ah\|$ and (if we use the Euclidean norm) we have $\|Ah\| \le \|A\| \|h\|$. – copper.hat Sep 12 '17 at 15:16
  • Thanks. also, I do not know if you saw my additional question - I editted it later. how is the fact the it is of order $\vert h \vert^2$ enough to justify removing it? – makansij Sep 13 '17 at 03:52
  • Also, what you are referring to when you say $\vert\vert A h \vert\vert \le \vert\vert A \vert\vert \, \vert\vert h\vert\vert$, you actually do not need it to be the Euclidea norm. *Holder's Inequality* allows that property to hold true for **any** norm! I just did not know that you could mix matrices and vectors when using Cauchy Schwarz, such as we have done here with matrix $A$ and vector $h$. – makansij Sep 13 '17 at 04:22
  • 1
    @Sother: The expression $\|Ax\| \le \|A\| \|x\|$ works for induced norms. However, it is irrelevant here in that it is always the case that $\|Ax\| \le K \|x\|$ for some $K$. – copper.hat Sep 13 '17 at 06:59
  • What is $K$ ? Is it a member of the set of real numbers? What set is it a member of? – makansij Sep 13 '17 at 15:48
  • 1
    It is a real constant. If the norm is induced it would be the norm of A. – copper.hat Sep 13 '17 at 15:59
  • 1
    I just learned a new trick when your independent variable is in more than two places within your formula: introduce a new (fake) parameter which will then disappear: $$\frac{\partial}{\partial x} y^TAx = \frac{\partial y}{\partial x}[Ax]^T+y^TA $$ The transpose was to make the vector a row vector. Nothing deep there! Now, if $y=x$ then $$ \frac{d}{dx} x^TAx = x^TA^T+x^TA = x^T(A+A^T) \ . $$ – Behnam Esmayli Sep 18 '17 at 21:38
  • @copper.hat Could you please elaborate on how $\frac{x^T(A + A^T)h}{h} = x^T(A + A^T)$? I understand we are canceling common terms ($h$) from the numerator and the denominator, however, what rules dictate such an operation to be legal in the general case? – Ankur Roy Chowdhury Mar 16 '19 at 20:33
  • 2
    @AnkurRoyChowdhury: I don't understand your question. $h$ is a vector, you can't divide by $h$. I didn't divide by $h$ anywhere. Note that the map $DQ(x)$ is a map $\mathbb{R}^n \to \mathbb{R}^n$, which can be represented by a matrix ${\partial Q(x) \over \partial x}$. In other words, $DQ(x)h = {\partial Q(x) \over \partial x}h$. – copper.hat Mar 16 '19 at 20:39
  • Why $||Ax|| \le ||A|| ||x||$ is true? Or since $||Ax||$ is finite, then $||Ax|| \le K||x||$ for some $K \ge 0$.Thus why smallest $K$ is $||A||$(I guess $||A||$ is smallest $K$)? – Spaceship222 Sep 06 '19 at 03:16
  • The stated solution is for numerator notation of vectors, but it seems arbitrary. How would one find the solution using denominator notation (derivative wrt a column vector is a column vector)? In that case would it be that $h^T DQ(x) = h^T \frac{\partial Q(x)}{\partial x}$ ? – DarkLink Oct 15 '21 at 13:13
8

You could also take the derivative of the scalar sum. \begin{equation} \begin{aligned} {\bf x^TAx} = \sum\limits_{j=1}^{n}x_j\sum\limits_{i=1}^{n}x_iA_{ji} \end{aligned} \end{equation} The derivative with respect to the $k$-th variable is then(product rule): \begin{equation} \begin{aligned} \frac{d {\bf x^TAx}}{d x_k} & = \sum\limits_{j=1}^{n}\frac{dx_j}{dx_k}\sum\limits_{i=1}^{n}x_iA_{ji} + \sum\limits_{j=1}^{n}x_j\sum\limits_{i=1}^{n} \frac{dx_i}{dx_k}A_{ji} \\ & = \sum\limits_{i=1}^{n}x_iA_{ki} + \sum\limits_{j=1}^{n}x_jA_{jk} \end{aligned} \end{equation}

If then you arrange these derivatives into a column vector, you get: \begin{equation} \begin{aligned} \begin{bmatrix} \sum\limits_{i=1}^{n}x_iA_{1i} + \sum\limits_{j=1}^{n}x_jA_{j1} \\ \sum\limits_{i=1}^{n}x_iA_{2i} + \sum\limits_{j=1}^{n}x_jA_{j2} \\ \vdots \\ \sum\limits_{i=1}^{n}x_iA_{ni} + \sum\limits_{j=1}^{n}x_jA_{jn} \\ \end{bmatrix} = {\bf Ax} + ({\bf x}^T{\bf A})^T = ({\bf A} + {\bf A}^T){\bf x} \end{aligned} \end{equation}

or if you choose to arrange them in a row, then you get: \begin{equation} \begin{aligned} \begin{bmatrix} \sum\limits_{i=1}^{n}x_iA_{1i} + \sum\limits_{j=1}^{n}x_jA_{j1} & \sum\limits_{i=1}^{n}x_iA_{2i} + \sum\limits_{j=1}^{n}x_jA_{j2} & \dots & \sum\limits_{i=1}^{n}x_iA_{ni} + \sum\limits_{j=1}^{n}x_jA_{jn} \end{bmatrix} \\ = ({\bf Ax} + ({\bf x}^T{\bf A})^T)^T = (({\bf A} + {\bf A}^T){\bf x})^T = {\bf x}^T({\bf A} + {\bf A}^T) \end{aligned} \end{equation}


wos
  • 148
  • 1
  • 6
6

It is easier using index notation with Einstein (repeated sum on dummy indices) rule. That is, we can write the $i$th component of $Ax$ as $a_{ij} x_j$, and $x^T A x=x_i a_{ij} x_j = a_{ij} x_i x_j$. Then take the derivative of $f(\bf{x})$ with respect to a component $x_k$. We find \begin{eqnarray} \partial f/\partial x_k = f,_k = a_{ij} x_{i,k} x_j + a_{ij} x_i x_{j,k} = a_{ij} \delta_{ik} x_j + a_{ij} x_i \delta_{jk} = a_{kj} x_j + a_{ik} x_i, \end{eqnarray} which in matrix notation is $k$th component of ${\bf{x}}^T A + {\bf{x}}^T A^T$.

Herman Jaramillo
  • 2,568
  • 21
  • 25
4

Yet another approach using the Frobenius product notation.

For a column vector $x \in \mathbb{R}^n$, and a matrix $A \in \mathbb{R}^{n \times n}$ we can write:

$$ x^TAx = Tr(x^TAx) = x:Ax$$

Then we take the differential and derivative as

\begin{align} d(x:Ax) & = dx:Ax + x:Adx\\ & = Ax:dx + A^Tx:dx\\ & = (Ax + A^Tx):dx\\ \frac{\partial (x^TAx)}{\partial x} &= (Ax + A^Tx) = (A + A^T)x \end{align}

MathLearner
  • 462
  • 2
  • 7
1

I just learned a new trick when your independent variable is in more than two places within your formula: introduce a new (fake) parameter which will then disappear:

$$\frac{\partial}{\partial x} y^TAx = \frac{\partial y}{\partial x}[Ax]^T+y^TA $$ The transpose was to make the vector a row vector. Nothing deep there!

Now, if $y=x$ then $$ \frac{d}{dx} x^TAx = x^TA^T+x^TA = x^T(A+A^T) \ . $$

Behnam Esmayli
  • 4,388
  • 13
  • 29