I'm learning Principal Component Analysis (PCA) and came to know that eigenvectors of the covariance matrix of the data are the principal components, which maximizes the variance of the projected data. I understand the intuition behind why we need the variance of projected data as large as possible.

From this answer, I don't understand the following line:

The unit vector $u$ which maximizes variance $u^TΣu$ is nothing but the eigenvector with the largest eigenvalue.

I know how the variance of projected data points is $u^TΣu$ from this answer. But I don't understand why this will be maxed when $u$ is selected as eigenvectors of $u^TΣu$ with the highest eigenvalues?

Intuitively I see eigenvectors as the vectors which stay fixed in their direction under the given linear transformation (values may scale, which are known as eigenvalues). Source: This answer. and this video.

I can't relate why vectors with a fixed direction under given linear transformation give the highest variance? Any intuitive explanation will help! Thanks.

Rodrigo de Azevedo
  • 18,977
  • 5
  • 36
  • 95
  • 595
  • 5
  • 14
  • This comes from the Rayleigh Quotient. http://www.svcl.ucsd.edu/courses/ece271B-F09/handouts/Dimensionality3.pdf look on pg 12. –  May 02 '19 at 22:19
  • The SVD algorithm decomposes $X = V^\top S U$ by iteratively substracting the best $s_j v_j u_j^\top$ approximation to $X$. Afterall $X^\top X = U^\top \Sigma U, D = S^\top S$ – reuns May 03 '19 at 01:18

4 Answers4


Let the spectral decomposition of $\Sigma$ be $\Sigma=PDP^T,$ where $P$ is orthonormal and $D$ is diagonal. Then $u^T\Sigma u=\displaystyle\sum_{i=1}^d\lambda_i(p_i^Tu)^2,$ where $p_i$ is the $i^{\text{th}}$ column of $P$, in other words, the $i^\text{th}$ eigenvector of $\Sigma.$

We want to find $u$ such that $\displaystyle\sum_{i=1}^d\lambda_i(p_i^Tu)^2$ is maximized. Since $p_i$'s form an orthonormal basis, $\displaystyle\sum_{i=1}^d(p_i^Tu)^2=1.$ Consider the optimization problem: $$\text{Maximize }\displaystyle\sum_{i=1}^d\lambda_iz_i^2\text{ subject to }\sum_{i=1}^dz_i^2=1.$$ Suppose $\lambda_1\ge\lambda_2\ge\dots\ge\lambda_d.$ It is clear that we must have $z_1=1,$ $z_i=0$ for the rest, because otherwise we will have a lower value of the objective function. That would mean $$p_1^Tu=1,\text{ and }p_i^Tu=0\text{ for all }i\neq 1.$$ By equality in Cauchy-Schwarz inequality, $p_1^Tu=1\iff u=c\times p_1,$ for some constant $c.$ By the norm $1$ constraint, $u=p_1.$

Arnab Auddy
  • 1,033
  • 6
  • 14

Take A as the original matrix, x is the eigenvector and λ is the corresponding eigenvalue, then we have Ax=λx. There is an intuitive way to think about it. What we are actually comparing is the projection of A onto the eigenvector x. The larger the projection is, the more variance that vector will represent. Based on this idea, the projection of A onto x is Ax/||x||, which equals to λx/||x||. Since x/||x|| is the unit vector of eigenvector, we only need to check λ. Therefore, the largest λ indicates that the eigenvector retains the largest variance.

  • 31
  • 1

I think the previous answers already contain the meat of the matter but I just want to add the probability point of view so that it's a bit more obvious what the question is asking and why that's the right answer.

Suppose I have some data drawn from some distribution $x_1,x_2,...\sim\mathcal D$ and where $x_i\in \mathbb R^n$. Then I want to find some 1-dimensional representation of $x_i$ where the data is maximized. Well, all projections on 1-dimensional spaces can be written as $$ P_c(x) = c^Tx $$ for some $c$ where $c^Tc = 1$.

Your question is basically, what is the choice of $c$ such that the variance $$ \mathrm{Var}(P_c(x)) = \mathbb E\left[\left(c^Tx-c^T\bar x\right)^2\right] = c^T\underbrace{(\mathbb E[xx^T] - \bar x \bar x^T)}_{=\Sigma}c $$ is maximized, where $\bar x = \mathbb E[x]$.

So, you just need to find the normalized vector $c$ such that $c^T\Sigma c$ is as large as possible. This is $c$ = the eigenvector corresponding to the largest eigenvalue of $\Sigma$. To see this, take the eigenvalue decomposition $\Sigma = PD P^T$ where $P = (p_1,...,p_n)$ and $D = \mathrm{diag}(d_1,...,d_n)$. Without loss of generality, we assume $d_1 \geq d_2 \geq \cdots \geq d_n$. Then taking a change of variables $v = P^Tc$, we can see that

$$ c^T\Sigma c = \sum_{i=1}^n v_i^2 d_i \leq d_1 \underbrace{v^Tv}_{=1} $$ since $v^Tv = c^TP^TPc=c^Tc = 1$. This maximum is achieved with equality only if $v = (1,0,...)$ which corresponds to $c = p_1$.

Y. S.
  • 1,634
  • 9
  • 15

An intuitive alternative (similar to Arnab's answer) is: $$u^TAu =u^T\Sigma D\Sigma^Tu $$ using spectral theorem. D is a square diagonal matrix with the eigenvalues of $A$. We also know that $\Sigma$ is orthogonal. Through matrix operations we get: $$u^T\Sigma D\Sigma^Tu=\sum_{i=1}^{p} \lambda_i \langle u,\sigma_i \rangle^2$$ where $\lambda_i$ are the eigenvalues of A, the $\sigma_i$ are the column vectors of $\Sigma$ and $p$ is the length of $u$. From Axler's Linear Algebra Done Right Thm 6.30, we get that $$\sum_{i=1}^{p} \langle u,\sigma_i\rangle^2=\lVert u\rVert^2=1$$ The last equality comes from the fact that we are only considering unit vectors. We can now interpret $\sum_{i=1}^{p} \lambda_i \langle u,\sigma_i\rangle^2$ as a weighted average of the $\lambda_i$s. This weighted average must obviously be less than or equal to the largest eigenvalue. Hence the best value we can get for the objective function is that of the largest eigenvalue. The only way we can achieve this is if $u$ equals the eigenvector corresponding to the largest eigenvalue.