64

It's stated that the gradient of:

$$\frac{1}{2}x^TAx - b^Tx +c$$

is

$$\frac{1}{2}A^Tx + \frac{1}{2}Ax - b$$

How do you grind out this equation? Or specifically, how do you get from $x^TAx$ to $A^Tx + Ax$?

Rodrigo de Azevedo
  • 18,977
  • 5
  • 36
  • 95
victor
  • 753
  • 1
  • 6
  • 6

5 Answers5

47

The only thing you need to remember/know is that $$\dfrac{\partial (x^Ty)}{\partial x} = y$$ and the chain rule, which goes as $$\dfrac{d(f(x,y))}{d x} = \dfrac{\partial (f(x,y))}{\partial x} + \dfrac{d( y^T(x))}{d x} \dfrac{\partial (f(x,y))}{\partial y}$$ Hence, $$\dfrac{d(b^Tx)}{d x} = \dfrac{d (x^Tb)}{d x} = b$$

$$\dfrac{d (x^TAx)}{d x} = \dfrac{\partial (x^Ty)}{\partial x} + \dfrac{d (y(x)^T)}{d x} \dfrac{\partial (x^Ty)}{\partial y}$$ where $y = Ax$. And then, that is,

$$\dfrac{d (x^TAx)}{d x} = \dfrac{\partial (x^Ty)}{\partial x} + \dfrac{d( y(x)^T)}{d x} \dfrac{\partial (x^Ty)}{\partial y} = y + \dfrac{d (x^TA^T)}{d x} x = y + A^Tx = (A+A^T)x$$

LIBayes
  • 3
  • 2
  • 25
    To help future generations: the full specification of the chain rule used here is $$ \frac{df(g,h)}{dx} = \frac{d(g(x)^T)}{dx} \frac{\partial f(g,h)}{\partial g} + \frac{d(h(x)^T)}{dx} \frac{\partial f(g,h)}{\partial h} $$ The order of multiplication is very important since we're dealing with vectors! – Neil Traft Sep 23 '14 at 09:58
  • 2
    the first statement seems wrong to me. Isn't the right statement $ \nabla_x(x^Ty) = y$? @NeilTraft – Charlie Parker Oct 21 '17 at 23:22
  • 2
    Like for example how does the answerer know where $\dfrac{\partial y^T}{\partial x}$ goes on the left or on the right or if there is a transpose or not? Or maybe I just unfamiliar with the chain rule using gradients and I only know it using partial derivatives. – Charlie Parker Oct 21 '17 at 23:37
  • 1
    Also notice that the derivative wrt a column vector is a row vector, and vice versa. (I learned this from @copper.hat https://math.stackexchange.com/questions/189434/derivative-of-quadratic-form). However, the _gradient_ is represented as a column vector. – makansij May 20 '18 at 15:34
  • 2
    Where can someone learn about these differentiation rules? In my standard analysis and calculus courses, we didn't see the differentiation of matrices or vectors, only of multivariate functions – Euler_Salter Oct 15 '18 at 09:44
  • 3
    @Euler_Salter just write it as a multivariate function of the components of the vector $x$ and take the gradient. You will get the same result. – j. kookalinski Jan 20 '19 at 21:49
  • 1
    That is actually very helpful! – Euler_Salter Jan 21 '19 at 12:23
  • @Euler_Salter http://sites.science.oregonstate.edu/math/home/programs/undergrad/CalculusQuestStudyGuides/vcalc/chain/chain.html – Learning stats by example Oct 17 '20 at 21:31
  • @NeilTraft I am confused since $f$ is a scalar function then $\partial f / \partial g$ should also be a scalar function so the order shouldn't matter, what am I missing? – Makogan Jan 25 '22 at 06:50
  • I don't remember what's going on here super well, but I think in the chain rule I wrote $g$ and $h$ are vector functions (for example $x^TA^T$), so the partial w.r.t. $g$ is also a vector? – Neil Traft Jan 26 '22 at 17:09
18

Let $f : \mathbb R^n \to \mathbb R$ be defined by

$$f (\mathrm x) := \rm x^\top A \, x$$

Hence,

$$f (\mathrm x + h \mathrm v) = (\mathrm x + h \mathrm v)^\top \mathrm A \, (\mathrm x + h \mathrm v) = f (\mathrm x) + h \, \mathrm v^\top \mathrm A \,\mathrm x + h \, \mathrm x^\top \mathrm A \,\mathrm v + h^2 \, \mathrm v^\top \mathrm A \,\mathrm v$$

Thus, the directional derivative of $f$ in the direction of $\rm v$ at $\rm x$ is

$$\lim_{h \to 0} \frac{f (\mathrm x + h \mathrm v) - f (\mathrm x)}{h} = \mathrm v^\top \mathrm A \,\mathrm x + \mathrm x^\top \mathrm A \,\mathrm v = \langle \mathrm v , \mathrm A \,\mathrm x \rangle + \langle \mathrm A^\top \mathrm x , \mathrm v \rangle = \left\langle \mathrm v , \color{blue}{\left(\mathrm A + \mathrm A^\top\right) \,\mathrm x} \right\rangle$$

Lastly, the gradient of $f$ with respect to $\rm x$ is

$$\nabla_{\mathrm x} \, f (\mathrm x) = \color{blue}{\left(\mathrm A + \mathrm A^\top\right) \,\mathrm x}$$


Rodrigo de Azevedo
  • 18,977
  • 5
  • 36
  • 95
8

I am just writing this answer for future reference and for clarity because the accepted answer is not completely correct and may case confusion.

I will use a simple proof to better understand the multiplication rule in the calculation of $\nabla\mathbf{x^TAx}$.

For $\mathbf{x}\in \mathbb{R}^n$ and $\mathbf{A}\in \mathbb{R}^{n\times n}$ let: $$f(g(\mathbf{x}),h(\mathbf{x}))=\langle g(\mathbf{x}),h(\mathbf{x})\rangle=g^T(\mathbf{x})h(\mathbf{x})$$

Where: \begin{equation} \begin{split} &g(\mathbf{x})=\mathbf{x}\\ &h(\mathbf{x})=\mathbf{Ax} \end{split} \end{equation}

From the definition of $f$, it is obvious that $f(g(\mathbf{x}),h(\mathbf{x}))=\mathbf{x^TAx}$.

In order to calculate the derivative, we will use the following fundamental properties, where $\mathbf{I}$ is the identity matrix:

\begin{equation} \begin{split} &\dfrac{\partial \mathbf{A^Tx}}{\partial\mathbf{x}}=\dfrac{\partial \mathbf{x^TA}}{\partial\mathbf{x}}=\mathbf{A^T}\\ &\dfrac{\partial \mathbf{x}}{\partial\mathbf{x}}=\mathbf{I} \end{split} \end{equation}

Hence, from the multiplication rule (you can see the rule at wikipedia), we got: \begin{equation} \begin{split} \dfrac{df(g(\mathbf{x}),h(\mathbf{x}))}{d\mathbf{x}}&=g^T(\mathbf{x})\dfrac{\partial h(\mathbf{x})}{\partial\mathbf{x}}+h^T(\mathbf{x})\dfrac{\partial g(\mathbf{x})}{\partial\mathbf{x}}=\\ &=\mathbf{x^T}\dfrac{\partial \mathbf{Ax}}{\partial\mathbf{x}}+(\mathbf{Ax})^T\dfrac{\partial \mathbf{x}}{\partial\mathbf{x}}=\\ &=\mathbf{x^TA}+\mathbf{x^TA^TI}=\\ &=\mathbf{x^TA}+\mathbf{x^TA^T}=\\ &=\mathbf{x^T}(\mathbf{A+A^T}) \end{split} \end{equation}

As a result, from the definition of gradient, we got: $$ \nabla f=\Bigg(\dfrac{df}{d\mathbf{x}}\Bigg)^T=(\mathbf{x^T}(\mathbf{A+A^T}))^T=(\mathbf{A^T+A})\mathbf{x} $$

Note: The reason I did the proof this way is to be more generalizable. So you can can plug arbitrary functions $g$ and $h$ and use the above multiplication rule to derive the result.

G.Margaritis
  • 81
  • 1
  • 1
  • 1
    Thank you. Till date this is the simplest of all answers. However, reading [this first](https://en.wikipedia.org/wiki/Matrix_calculus#Layout_conventions) might be helpful. – Ombrophile Aug 03 '20 at 07:56
7

There is another way to calculate the most complex one, $\frac{\partial}{\partial \theta_k} \mathbf{x}^T A \mathbf{x}$. It only requires nothing but partial derivative of a variable instead of a vector.

This answer is for those who are not very familiar with partial derivative and chain rule for vectors, for example, me. Therefore, although it seems long, it is actually because I write down all the details. :)

Firstly, expanding the quadratic form yields: $$ \begin{align} f = \frac{\partial}{\partial \theta_k} \mathbf{x}^T A \mathbf{x} &= \frac{\partial}{\partial \theta_k} \sum_{i=1}^N \sum_{j=1}^N a_{ij}\frac{\partial}{\partial \theta_k}(\mathbf{x}_i \mathbf{x}_j) \end{align} $$ Since $$ \frac{\partial}{\partial \theta_k}(\mathbf{x}_i \mathbf{x}_j) = \begin{cases} \mathbf{x}_j, && \text{if } k = i \\ \mathbf{x}_i, && \text{if } k = j \\ 0, && \text{otherwise} \end{cases} $$ The equation is nothing but $$ f = \sum_{j=1}^N a_{kj} \mathbf{x}_j + \sum_{i=1}^N a_{ik} \mathbf{x}_i $$ Almost done! Now we only need some simplification. Recall the very simple rule that $$ \sum_{i=1}^N x_i y_i = \begin{bmatrix} x_1 \\ \vdots \\ x_n \end{bmatrix}^T \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix} = \mathbf{x}^T \mathbf{y} $$ Thus $$ \begin{align} f &= \text{(k-th row of A) } \mathbf{x} + \text{(k-th column of A)}^T \mathbf{x} \end{align} $$ Now it is time to compute the gradient from partial derivative! $$ \begin{align} \nabla_\mathbf{x} \mathbf{x}^T A \mathbf{x} & = \begin{bmatrix} \frac{\partial \mathbf{x}^T A \mathbf{x}}{\partial x_1} \\ \vdots \\ \frac{\partial \mathbf{x}^T A \mathbf{x}}{\partial x_k} \\ \vdots \\ \frac{\partial \mathbf{x}^T A \mathbf{x}}{\partial x_N} \\ \end{bmatrix} = \begin{bmatrix} \vdots \\ \text{(k-th row of A) } \mathbf{x} + \text{(k-th column of A)}^T \mathbf{x} \\ \vdots \end{bmatrix} \\ &= \left( \begin{bmatrix} \vdots \\ \text{(k-th row of A) } \\ \vdots \end{bmatrix} + \begin{bmatrix} \vdots \\ \text{(k-th column of A) }^T \\ \vdots \end{bmatrix} \right) \mathbf{x} \\ &= (A + A^T)\mathbf{x} \end{align} $$ So we are done!! The answer is: $$ \nabla_\mathbf{x} \mathbf{x}^T A \mathbf{x} = (A + A^T)\mathbf{x} $$

ch271828n
  • 201
  • 2
  • 9
3

Yet another approach.

We will utilize the following the identities

  • Trace and Frobenius product relation $$\left\langle A, B \right\rangle={\rm tr}(A^TB) = A:B$$ or $$\left\langle A^T, B \right\rangle ={\rm tr}(AB) = A^T:B$$
  • Cyclic property of Trace/Frobenius product \begin{align} \left\langle A, B C \right\rangle \equiv A:BC &= AC^T:B\\ &= B^TA:C\\ &= BC : A\\ &= {\text{etc.}} \cr \end{align}

Let $f(x) := \left( \frac{1}{2} x^T A x - b^T x + c \right) $.

We obtain the differential first, and then the gradient subsequently. \begin{align} d\,f(x) &= d\left( \frac{1}{2} x^T A x - b^T x + c \right) \\ &= d\left( \frac{1}{2} \left( x: A x \right) - \left( b : x \right) + c \right) \\ &= \frac{1}{2} \left[ \left( dx: A x \right) + \left( x: A dx \right) \right] - \left( b : dx \right) \\ &= \frac{1}{2} \left[ \left( A x : dx \right) + \left( A^Tx: dx \right) \right] - \left( b : dx \right) \\ &= \frac{1}{2} \left[ \left(A + A^T \right)x: dx \right] - \left( b : dx \right) \\ &= \left( \frac{1}{2} \left[ \left(A + A^T \right)x\right] - b \right): dx \\ \end{align}

Thus the gradient is $$ \eqalign { \frac { \partial} {\partial x}f(x) &= \frac{1}{2} \left[ \left(A + A^T \right)x\right] - b \\ &= \frac{1}{2} A^T x + \frac{1}{2} A x - b . \cr } $$

user550103
  • 2,475
  • 1
  • 8
  • 23