Consider running gradient descent (GD) on the following optimization problem:

$$\arg\min_{\mathbf x \in \mathbb R^n} \| A\mathbf x-\mathbf b \|_2^2$$

where $\mathbf b$ lies in the column space of $A$, and the columns of $A$ are not linearly independent. Is it true that GD would find a solution with minimum norm? I saw some articles (e.g., 1705.09280) that indicated so, but I couldn't find a proof, searching on the internet for a while.

Can someone confirm or refute it? And if it's true, a proof or a reference to the proof would be much appreciated!

**EDITS 2019/11/27:**

Thanks to littleO's answer, apparently the answer to this question is *no* in general. However, I'm still curious about the following:

**Follow-up Question:** Are there some constraints under which the answer is yes? Is it true that, as Clement C. suggested, if we initialize $\mathbf x$ in the range of $A^\top$, then GD finds the minimum-norm solution? Is this a sufficient condition or is it also necessary?

It appears to me that the answer is yes, *if and only if* we initialize $\mathbf x$ in the range of $A^\top$.

I'll list my arguments below and would appreciate it if someone would confirm it or point out where I'm mistaken.

**My arguments:** Let $f(\mathbf x)= \| A\mathbf x-\mathbf b \|_2^2$. Then $\nabla_{\mathbf x}f(\mathbf x) = 2A^\top(A\mathbf x - \mathbf b),$ and GD iterates as follows: $\mathbf x^{(t+1)}=\mathbf x^{(t)}-\eta \nabla_{\mathbf x}f(\mathbf x^{(t)})$. Note that all GD updates are in the range of $A^\top$. Hence we may write $\mathbf x^{(t)}=\mathbf x^{(0)}+A^\top \mathbf u$ for some vector $\mathbf u$.

Sufficiency: Suppose $\mathbf x^{(0)}$ is also in the range of $A^\top$, i.e. $\mathbf x^{(0)}=A^\top \mathbf v$. Then $\mathbf x^{(t)}=A^\top (\mathbf v+\mathbf u).$ Since $f(\mathbf x)$ is convex, we know that GD will converge to a global minimum ($0$) if the step size is small enough. Denote this by $\mathbf x^{(t)} \to \mathbf x^* = A^\top \mathbf u^*$. Hence $A\mathbf x^*-\mathbf b=AA^\top \mathbf u^*-\mathbf b=\mathbf 0$, so $\mathbf u^*=(AA^\top)^{-1}\mathbf b$ (assuming $A$ is full rank), and $\mathbf x^*=A^\top (AA^\top)^{-1}\mathbf b$, which is the well-known minimum norm solution. (If $A$ is not full (row) rank, we can delete some redundant rows.)

Necessity: Now suppose $\mathbf x^{(0)} \notin \mathrm{range}(A^\top)$, and $\mathbf x^{(t)} \to \mathbf x^*$. We necessarily have $\mathbf x^* = A^\top \mathbf u^* + \mathbf x^{(0)}$ for some $\mathbf u^*$. However, clearly $\mathbf x^*\notin \mathrm{range}(A^\top)$, so it cannot possibly be the (unique) minimum norm solution, $ A^\top (AA^\top)^{-1}\mathbf b$.