I'm studying about EM-algorithm and on one point in my reference the author is taking a derivative of a function with respect to a matrix. Could someone explain how does one take the derivative of a function with respect to a matrix...I don't understand the idea. For example, lets say we have a multidimensional Gaussian function:

$$f(\textbf{x}, \Sigma, \boldsymbol \mu) = \frac{1}{\sqrt{(2\pi)^k |\Sigma|}}\exp\left( -\frac{1}{2}(\textbf{x}-\boldsymbol \mu)^T\Sigma^{-1}(\textbf{x}-\boldsymbol \mu)\right),$$

where $\textbf{x} = (x_1, ..., x_n)$, $\;\;x_i \in \mathbb R$, $\;\;\boldsymbol \mu = (\mu_1, ..., \mu_n)$, $\;\;\mu_i \in \mathbb R$ and $\Sigma$ is the $n\times n$ covariance matrix.

How would one calculate $\displaystyle \frac{\partial f}{\partial \Sigma}$? What about $\displaystyle \frac{\partial f}{\partial \boldsymbol \mu}$ or $\displaystyle \frac{\partial f}{\partial \textbf{x}}$ (Aren't these two actually just special cases of the first one)?

Thnx for any help. If you're wondering where I got this question in my mind, I got it from reading this reference: (page 14)



I added the particular part from my reference here if someone is interested :) I highlighted the parts where I got confused, namely the part where the author takes the derivative with respect to a matrix (the sigma in the picture is also a covariance matrix. The author is estimating the optimal parameters for Gaussian mixture model, by using the EM-algorithm):

$Q(\theta|\theta_n)\equiv E_Z\{\log p(Z,X|\theta)|X,\theta_n\}$

enter image description here

Rodrigo de Azevedo
  • 18,977
  • 5
  • 36
  • 95
  • 8,135
  • 12
  • 49
  • 89
  • 1
    Possibly helpful: http://math.stackexchange.com/questions/94562/matrix-vector-derivative – dreamer Dec 30 '13 at 12:18
  • 1
    How is the function $Q(\theta|\theta_n)$ defined in the screenshot? – dreamer Dec 30 '13 at 12:24
  • 1
    @Dreamer I'll add it into the post. One sec, you can also see it from the reference pages 7-8 – jjepsuomi Dec 30 '13 at 12:25
  • 1
    I added the definition for $Q(\theta|\theta_n)$. However, I only require to understand how the author is doing the calculations in the reference. A simple example using multidimensional Gaussian is enough :) I can calculate it then myself for my specific problem (EM-algorithm). – jjepsuomi Dec 30 '13 at 12:29

2 Answers2


It's not the derivative with respect to a matrix really. It's the derivative of $f$ with respect to each element of a matrix and the result is a matrix.

Although the calculations are different, it is the same idea as a Jacobian matrix. Each entry is a derivative with respect to a different variable.

Same goes with $\frac{\partial f}{\partial \mu}$, it is a vector made of derivatives with respect to each element in $\mu$.

You could think of them as $$\bigg[\frac{\partial f}{\partial \Sigma}\bigg]_{i,j} = \frac{\partial f}{\partial \sigma^2_{i,j}} \qquad \text{and}\qquad \bigg[\frac{\partial f}{\partial \mu}\bigg]_i = \frac{\partial f}{\partial \mu_i}$$ where $\sigma^2_{i,j}$ is the $(i,j)$th covariance in $\Sigma$ and $\mu_i$ is the $i$th element of the mean vector $\mu$.

  • 4,339
  • 1
  • 22
  • 31

You can view this in the same way you would view a function of any vector. A matrix is just a vector in a normed space where the norm can be represented in any number of ways. One possible norm would be the root-mean-square of the coefficients; another would be the sum of the absolute values of the matrix coefficients. Another is as the norm of the matrix as a linear operator on a vector space with its own norm.

What is significant is that the invertible matrices are an open set; so a derivative can make sense. What you have to do is find a way to approximate $$ f(x,\Sigma + \Delta\Sigma,\mu)-f(x,\Sigma,\mu)$$ as a linear function of $\Delta\Sigma$. I would use a power series to find a linear approximation. For example, $$ (\Sigma+\Delta\Sigma)^{-1}=\Sigma^{-1}(I+(\Delta\Sigma) \Sigma^{-1})^{-1} =\Sigma^{-1} \sum_{n=0}^{\infty}(-1)^{n}\{ (\Delta\Sigma)\Sigma^{-1}\}^{n} \approx \Sigma^{-1}(I-(\Delta\Sigma)\Sigma^{-1})$$ Such a series converges for $\|\Delta\Sigma\|$ small enough (using whatever norm you choose.) And, in the language of derivatives, $$ (\frac{d}{d\Sigma} \Sigma^{-1})\Delta\Sigma = -\Sigma^{-1}(\Delta\Sigma)\Sigma^{-1} $$ Remember, that the derivative is a linear operator on $\Delta\Sigma$; if you squint you can almost see the classical term $\frac{d}{dx}x^{-1} =-x^{-2}$. Chain rules for derivatives apply. So that's how you can handle the exponential composed with matrix inversion.

Disintegrating By Parts
  • 79,842
  • 5
  • 49
  • 126