I'm a software engineer trying to learn linear algebra and feel like I'm having a hard time following matrix computations.

For example, this is a part of the least squared method for linear model:

$$\sum\limits_{i=1}^n ||\mathbf\theta^T\mathbf x_i-y_i||^2=(\mathbf{X\theta}-\mathbf y)^T(\mathbf{X\theta}-\mathbf y).$$

How do we jump from the first line, where there's a lot going on like Sigma $i=1\to n$, norm squared, $x_i$, $y_i$, etc., to the second line where those are wrapped nicely in that matrix representation with transpose thing?

I know that can arrive at the second line if I carefully write down, try playing with concrete matrices, and I'm very slow with this.

Is there any other way to reason, or visualize it? How do mathematicians tackle this kind of thing? Or everyone's kind of struggle with it privately too?

  • 6,270
  • 2
  • 21
  • 50
  • 241
  • 1
  • 5
  • 5
    For a vector $x=\begin{pmatrix}x_1\\\vdots\\x_n\end{pmatrix}$, you have that $\|x\|=\sqrt{x_1^2+...+x_n^2}$. Compute $x^Tx$, which is the sum of the products of the corresponding terms, i.e. $x_1x_1+...+x_nx_n$. –  Jul 21 '18 at 09:46
  • 8
    One of the tricks with reading (and writing) mathematics is that not every step of a long calculation needs to be written down, as long as a mathematically educated reader can reconstruct the intermediate steps of the calculation. That doesn't mean the reconstruction should immediately **pop** into your mind, sometimes it will take some work. For a writer this can be a difficult balance: some readers will complain you've put in too many boring details; others will complain you've left out too many steps. – Lee Mosher Jul 21 '18 at 13:52

5 Answers5


This is a good question which you have already answered for yourself.

Mathematicians do this a lot:

carefully write down, try playing with concrete matrices

After a while (a long while sometimes) you see how it works. Then you can parse similar expressions more quickly.

There really is no shortcut.

Ethan Bolker
  • 80,490
  • 6
  • 95
  • 173
  • 3
    Thank you. It's a huge relief knowing that there's nothing wrong with me being slow with this kind of things. – aunnnn Jul 21 '18 at 09:56
  • 4
    It really is an arduous process. You only really get used to it by doing it a lot. – Sambo Jul 21 '18 at 12:28

Let's visualize it.

We have the scalar expression: $$\boldsymbol\theta^T \mathbf x_i - y_i = \begin{bmatrix}&&\boldsymbol\theta^T&&\end{bmatrix}\begin{bmatrix} \\ \\ \mathbf x_i \\ \\ \\ \end{bmatrix} - y_i \tag 1 $$ Transpose this expression to get the same scalar: $$(\boldsymbol\theta^T \mathbf x_i - y_i)^T = \mathbf x_i^T \boldsymbol\theta - y_i = \begin{bmatrix} & & \mathbf x_i^T & & \end{bmatrix}\begin{bmatrix}\\\\\boldsymbol\theta\\\\\\\end{bmatrix} - y_i \tag 2 $$ Extend into matrices and vectors: $$\begin{bmatrix}\\X \boldsymbol\theta - \mathbf y \\\\\end{bmatrix} = \begin{bmatrix} & & \mathbf x_1^T & & \\ &&\vdots\\ & & \mathbf x_n^T & & \\\end{bmatrix} \begin{bmatrix}\\\\\boldsymbol\theta\\\\\\\end{bmatrix} - \begin{bmatrix}\\\mathbf y\\\\\end{bmatrix} \tag 3 $$ The definition of the norm says: $$\sum |a_i|^2 = \|\mathbf a\|^2 = \mathbf a^T \mathbf a \tag 4$$ Substitute our expression in the norm: $$\sum_{i=1}^n |\boldsymbol\theta^T \mathbf x_i - y_i|^2 = \sum_{i=1}^n |(X \boldsymbol\theta - \mathbf y)_i|^2 = \|X \boldsymbol\theta - \mathbf y\|^2 = (X \boldsymbol\theta - \mathbf y)^T (X \boldsymbol\theta - \mathbf y) \tag 5$$

Klaas van Aarsen
  • 5,652
  • 1
  • 11
  • 23

The fist sum $$\sum_{i=1}^n\|\theta^Tx_i-y_i\|^2$$ is nothing other than the expanded form for the norm squared of some vector (in this case can be regarded as some distance, but I want to simplify things). The norm squared can be expressed as the scalar product of the above mentioned vector with itself. Let me explain: take a vector $\mathbf{v}\in\mathbb{R}^3$ defined as $$\mathbf{v} = (x,y,z).$$ From the Pythagorean theorem, we know that the length squared of this vector is $$\|\mathbf{v}\|^2 = x^2+y^2+z^2.$$ Mathematically speaking, the length and the norm are the same thing (norm is more general). We can define the norm of a vector with the scalar product in this manner $$\|\mathbf{v}\|^2=\mathbf{v}\cdot \mathbf{v} = \mathbb{v}^T\mathbb{v}= \left(\begin{matrix}x&y&z\end{matrix}\right)\left(\begin{matrix}x\\y\\z\end{matrix}\right) = x^2+y^2+z^2 = \sum_{i=1}^3 x_i^2$$ from basic matrix multiplication and where I've written $(x,y,z)=(x_1,x_2,x_3)$.

So back to our example $$\sum_{i=1}^n\|\theta^Tx_i-y_i\|^2.$$ In this case the vector we're taking the norm is defined as having the $i$-th component as $v_i = \theta^Tx_i-y_i$ which is the difference (component by component) of two vectors $\theta\mathbf{x}$ and $\mathbf{y}$. So the vector itself is to be written as $$\mathbf{v} = \theta\mathbf{x}-\mathbf{y}.$$ Now knowing what I told earlier, the norm squared of this vector is $$\mathbf{v}^T \mathbf{v}= (\theta\mathbf{x}-\mathbf{y})^T(\theta\mathbf{x}-\mathbf{y}).$$

  • 6,270
  • 2
  • 21
  • 50
  • 3,378
  • 1
  • 12
  • 32
  • Thank you, this detailed explanation really helps. – aunnnn Jul 21 '18 at 10:02
  • 1
    You're welcome! As Ethan Bolker said, don't be discouraged, as with all mathematics, grasping new topics is difficult for everyone! The only way you can master them is by trial and error! Play with it and you'll be very surprised when it all comes nicely together – Quiver Jul 21 '18 at 10:07
  • Is $\theta$ a scalar? If so it is confusing to write $\theta^T$ and $||.||^2$ in the sum! What is $X$ in the OP´s question, a matrix? – Peter Melech Jul 21 '18 at 10:27
  • 1
    \theta it's a scalar, I'm thinking of some abuse of notation for the transpose but I like Serena gave some explanation. As to why there's the norm squared in the sum that's because in general $$\| \mathbf{v}\|\ = \sum_{i=1}^n|x_i|^2$$ – Quiver Jul 21 '18 at 10:38
  • See I like Serena´s answer: $\theta$ and $x_i$ are vectors in $\mathbb{R}^n$ and $y_i$ are scalars, and $X\in\mathbb{R}^{n\times n}$, a matrix – Peter Melech Jul 21 '18 at 11:05

Most mathematicians would make the jump from the first line to the second line instantly, not because they are smart, but because they have seen that sort of thing before. The first couple of times you encounter $\|\mathbf{v}\|^2 = \mathbf{v}^T\mathbf{v}$ you have to puzzle through it. Sooner rather than later you simply recognize it.

You can doubtless think of similar examples in your software background. When you are learning a new programming language it might all seem mysterious. You might need to spend a lot of time understanding what a single line of code does, so understanding a large program in that language seems out of reach. But, as you become fluent in that language you start to pick up on common idioms. Rather than puzzling over every single line you develop an ability to stand back and instantly see what an entire block of code does. Similarly, the more math you do the more you develop the ability to stand back and see the overall flow of a proof or computation. It never becomes as easy as reading a novel, but it ceases to be an endless stream of new enigmas.

You can also use your ability to program as a tool to help you assimilate the math. The equality of those two lines is a computational fact. You can write two functions, one which does the computation on the first line, and one that does the computation on the second line. Verify that they give the same output (up to floating point error) for various inputs. That wouldn't constitute a proof, but would give you more insight into what is going on. For linear algebra using something like R which has things like transpose and matrix multiplication built into the language itself would a good language choice. Python with numpy is another good choice for quickly translating mathematical ideas to code.

John Coleman
  • 5,211
  • 2
  • 16
  • 26

"How to follow matrix operations in proofs" - trying to follow the proof through with a small, concrete example has been suggested elsewhere.

But one thing I've always found helpful if things are getting hairy is to annotate the proof by writing the size of each matrix or vector underneath it.

For example $2 \times 3$ or just $n \times m$.

This is particularly useful in treatments of least-squares regression, where different matrices are different sizes: e.g. the vector of $y$ data is $n \times 1$, the design matrix (aka "model matrix") is $n \times k$, the vector of coefficients and the vector of coefficient estimators are $k \times 1$, the variance-covariance matrix of the estimators is $k \times k$... if you are trying to follow a proof of e.g. the Gauss-Markov theorem, then keep an eye on the sizes of the matrices and check that they are conformable at each stage.

  • 1,305
  • 17
  • 28