$\newcommand{\prox}{\operatorname{prox}}$ $\newcommand{\argmin}{\operatorname{argmin}}$

Suppose that we want to solve the following convex optimization problem:

$\min_{x \in \mathbb{R}^n} g(x) + h(x)$

where we assumed that $g(x)$ is convex and differentiable, $h(x)$ is convex (here I am trying to be as non-specific as possible). Then recall that **the generalized gradient descent** can be formulated as follows:

Step $0$: choose initial $x^{0} \in \mathbb{R}^n$

Step $k: k \ge 1$: $x^{(k)} = \prox_h (x^{(k-1)} - t_k \nabla g(x^{(k-1)}), t_k)$

where $\prox$ is a proximal operator defined as $\prox_h(y,t) := \argmin \limits_{x \in \mathbb{R}^n} h(x) + \frac{1}{2t} \|y-x\|^2$

It is known that if $\nabla g(x)$ is Lipschitz continuous and proximal operator can be evaluated, the convergence rate of will be $O(1/k)$. This result can be **accelerated** to achieve $O(1/k^2)$.

First time proposed by Nesterov in 1983 for smooth functions, the idea of acceleration still remains an active topic of research (for non-smooth, composite functions, etc.). It is not easy to read Nesterov's works (very mathematical), but in order to get an understanding of the concept it is sufficient to look at ISTA (Iterative Tthresholding Algorithms) and FISTA (Fast Iterative Thresholding Algorithms). In particular, my questions below will be based on FISTA's example:

Roughly speaking, acceleration is achieved by introducing one more sequence of numbers $y_k$ constructed as a specific linear combination of $x_k$ and $x_{k-1}$; proximal function operates then on $y_{k}$ instead of $x_k$. In case of FISTA we have:

$t_{k+1} = \frac{1 + \sqrt{1 + 4t^2_k}}{2}$

$y_{k+1} = x_k + \frac{t_k - 1}{t_{k+1}}(x_k - x_{k-1})$

Note, that the sequence $t_k$ satisfies $t^2_k = t^2_{k+1} - t_{k+1}$; this is justified in the proof of the convergence for this algorithm.

Is there any intuitive way to explain, interpret such an approach? Why such a specific combination works and brings such a perceptible improvement to the convergence rate? Can we find an intuitive way to interpret $t$? Probably somebody is actually familiar with Nesterov's works and have more knowledge that I do about some other reasons why $t_k$ is given in this form at first place?