439

Singular value decomposition (SVD) and principal component analysis (PCA) are two eigenvalue methods used to reduce a high-dimensional data set into fewer dimensions while retaining important information. Online articles say that these methods are 'related' but never specify the exact relation.

What is the intuitive relationship between PCA and SVD? As PCA uses the SVD in its calculation, clearly there is some 'extra' analysis done. What does PCA 'pay attention' to differently than the SVD? What kinds of relationships do each method utilize more in their calculations? Is one method 'blind' to a certain type of data that the other is not?

Rodrigo de Azevedo
  • 18,977
  • 5
  • 36
  • 95
wickedchicken
  • 4,501
  • 3
  • 14
  • 5
  • 22
    SVD and PCA and "total least-squares" (and several other names) are the same thing. It computes the orthogonal transform that decorrelates the variables and keeps the ones with the largest variance. There are two numerical approaches: one by SVD of the (centered) data matrix, and one by Eigen decomposition of this matrix "squared" (covariance). –  Jun 10 '14 at 08:21
  • 6
    Here is a link to a very similar thread on CrossValidated.SE: [Relationship between SVD and PCA. How to use SVD to perform PCA?](http://stats.stackexchange.com/questions/134282) It covers similar grounds to J.M.'s answer (+1 by the way), but in somewhat more detail. – amoeba Jan 24 '15 at 23:28
  • [how-to-find-straight-line-minimizing-the-sum-of-squares-of-euclidean-distances-f](http://stats.stackexchange.com/questions/52807/how-to-find-straight-line-minimizing-the-sum-of-squares-of-euclidean-distances-f) on stats.stackexchange has some links on the relationship between orthogonal regression and PCA. – denis Aug 30 '15 at 12:44

4 Answers4

360

(I assume for the purposes of this answer that the data has been preprocessed to have zero mean.)

Simply put, the PCA viewpoint requires that one compute the eigenvalues and eigenvectors of the covariance matrix, which is the product $\frac{1}{n-1}\mathbf X\mathbf X^\top$, where $\mathbf X$ is the data matrix. Since the covariance matrix is symmetric, the matrix is diagonalizable, and the eigenvectors can be normalized such that they are orthonormal:

$\frac{1}{n-1}\mathbf X\mathbf X^\top=\frac{1}{n-1}\mathbf W\mathbf D\mathbf W^\top$

On the other hand, applying SVD to the data matrix $\mathbf X$ as follows:

$\mathbf X=\mathbf U\mathbf \Sigma\mathbf V^\top$

and attempting to construct the covariance matrix from this decomposition gives $$ \frac{1}{n-1}\mathbf X\mathbf X^\top =\frac{1}{n-1}(\mathbf U\mathbf \Sigma\mathbf V^\top)(\mathbf U\mathbf \Sigma\mathbf V^\top)^\top = \frac{1}{n-1}(\mathbf U\mathbf \Sigma\mathbf V^\top)(\mathbf V\mathbf \Sigma\mathbf U^\top) $$

and since $\mathbf V$ is an orthogonal matrix ($\mathbf V^\top \mathbf V=\mathbf I$),

$\frac{1}{n-1}\mathbf X\mathbf X^\top=\frac{1}{n-1}\mathbf U\mathbf \Sigma^2 \mathbf U^\top$

and the correspondence is easily seen (the square roots of the eigenvalues of $\mathbf X\mathbf X^\top$ are the singular values of $\mathbf X$, etc.)

In fact, using the SVD to perform PCA makes much better sense numerically than forming the covariance matrix to begin with, since the formation of $\mathbf X\mathbf X^\top$ can cause loss of precision. This is detailed in books on numerical linear algebra, but I'll leave you with an example of a matrix that can be stable SVD'd, but forming $\mathbf X\mathbf X^\top$ can be disastrous, the Läuchli matrix:

$\begin{pmatrix}1&1&1\\ \epsilon&0&0\\0&\epsilon&0\\0&0&\epsilon\end{pmatrix}^\top,$

where $\epsilon$ is a tiny number.

amWhy
  • 204,278
  • 154
  • 264
  • 488
J. M. ain't a mathematician
  • 71,951
  • 6
  • 191
  • 335
  • 3
    To give a *Mathematica* example: `A=SparseArray[{{i_, 1} -> 1, {i_, j_} /; i + 1 == j :> $MachineEpsilon}, {3, 4}];` and then compare `Sqrt[Eigenvalues[a.Transpose[a]]]` and `SingularValueList[a,Tolerance->0]`. – J. M. ain't a mathematician Sep 02 '10 at 14:13
  • I see. So, in essence, SVD compares U to V, while PCA compares U to U itself. SVD gives you an eigenvector decomposition of the data, while PCA takes that decomposition and compares one side to itself to see which ones are more dominant. – wickedchicken Oct 29 '10 at 15:55
  • 1
    @J.M. could you ellaborate on why $AA^T$ calculation is disastrous for the matrix you've given? I calculated $AA^T$ for the numbers you've specified, I get a 3x3 matrix with all 1's. Am I doing something wrong? – BBSysDyn Jan 21 '13 at 09:59
  • @user6786, and you noticed that the matrix you got is singular, no? But the singular values are tiny, but not zero... – J. M. ain't a mathematician Mar 23 '13 at 12:26
  • 2
    Note that in practice, the columns of $W$ and $U$ (the principal components via the eigendecomposition versus singular value decomposition) may differ from each other by a factor of -1. – Ahmed Fasih Jan 16 '14 at 14:11
  • 4
    @J.M. - It is a bit unclear if the data matrix consists of row vectors or column vectors maybe could be good to mention so there is no misunderstanding. – Reed Richards Jan 21 '14 at 20:17
  • 3
    @J. M., for completeness what is the mathematical relation between the `W` matrix defined in your PCA explanation and the `U` matrix defined in your SVD explanation? – Zhubarb Sep 04 '14 at 10:39
  • 14
    This was a little confusing in that normally the data matrix has n rows of samples of data with d dimensions along columns, like a least squares design matrix. If that is true then the covariance is $X^TX$, and the SVD result is $V\Sigma V^T$. I was also confused by the lack of normalization initially. But altogether a pretty clear explanation. – Robotbugs Mar 03 '15 at 22:43
  • 1
    Maybe we should add the matrix U is also the eigenvectors. – SmallChess Nov 05 '15 at 10:49
  • @J.M. Please provide some good reference for this statement: "forming the covariance matrix to begin with, since the formation of XX⊤XX⊤ can cause loss of precision" – sv_jan5 Mar 08 '16 at 11:59
  • 2
    @J.M. You guessed it right. I found its reference in Golub/Van Loan on pg. 239. Thanks for help! – sv_jan5 Mar 09 '16 at 08:35
  • 1
    @sera, "In this settings, you assume that X rows are the features and columns the samples. right?" - yes. Note that the Läuchli example is actually a "wide" matrix and not a "tall" one due to the transposition. – J. M. ain't a mathematician Mar 02 '18 at 23:59
  • @Yoni, please be careful when editing formulae, so that the sentences surrounding it are still sensible, instead of mindlessly replacing all instances of an expression with search and replace. – J. M. ain't a mathematician Feb 27 '20 at 13:14
  • Since there is a lot of discussion here in the comments I asked a question about this post separately [here](https://math.stackexchange.com/q/3561160/545914) and would like to draw your attention to it. – Ramanujan Feb 28 '20 at 10:41
69

A tutorial on Principal Component Analysis by Jonathon Shlens is a good tutorial on PCA and its relation to SVD. Specifically, section VI: A More General Solution Using SVD.

Seanny123
  • 127
  • 8
hellectronic
  • 791
  • 5
  • 4
11

The question boils down to whether you what to subtract the means and divide by standard deviation first. The same question arises in the context of linear and logistic regression. So I'll reason by analogy.

In many problems our features are positive values such as counts of words or pixel intensities. Typically a higher count or a higher pixel intensity means that a feature is more useful for classification/regression. If you subtract the means then you are forcing features with original value of zero to have a negative value which is high in magnitude. This entails that you make the features values that are non-important to the problem of classification (previously zero valued) as influential as the most important features values (the ones that have high counts or pixel intensities).

The same reasoning holds for PCA. If your features are least sensitive (informative) towards the mean of the distribution, then it makes sense to subtract the mean. If the features are most sensitive towards the high values, then subtracting the mean does not make sense.

SVD does not subtract the means but often as a first step projects the data on the mean of all data points. In this way the SVD first takes care of global structure.

Stefan Savev
  • 219
  • 2
  • 3
  • 6
    I think this answer may be a bit misleading. The fact that zero value numbers will be mapped to negative numbers of large magnitude after subtracting means doesn't mean that their influence on a statistical model is increased. Deviation from the mean *is* the information used by many (perhaps most?) statistical models to fit curves, sort items, etc. If you are concerned about a feature with a long distribution tail (e.g. counts), then there are ways of transforming that data (e.g. add 1 and take the log) so it plays nice with models based on symmetric distributions. – turtlemonvh Jan 27 '16 at 04:50
4

There is a way to do an SVD on a sparse matrix that treats missing features as missing (using gradient search). I don't know any way to do PCA on a sparse matrix except by treating missing features as zero.

Phil Goetz
  • 151
  • 4