For a convex function $h$, its proximal operator is defined as: $$\operatorname{prox}_h(x)=\arg\min_u \Big(h(u)+\frac{1}{2}\ux\^2\Big)$$ Can anyone provide an intuitive explanation/motivation of proximal mapping?

1Not an answer, but... If you let $g(x) = \frac{1}{2} x^2$, then the function $h \square g (x) = \inf_{u} h(u)+g(xu)$ is called the infimal convolution and corresponds to the function you get from the set $\mathbb{epi}\, h + \mathbb{epi} \, g$. So there is a nice geometrical interpretation as a 'smoothing' of sorts. Rockafellar's classic Convex Analysis is a good read & reference. Predates the current popularization of convex analysis. – copper.hat Jul 08 '12 at 10:01

1Oops, I should have been more explicit, $\mathbb{epi} \, h \square g = \mathbb{epi} \,h + \mathbb{epi} \, g$. The epigraph defines the function. – copper.hat Jul 08 '12 at 16:10

Thank you very much. I have ordered one copy. – mining Jul 08 '12 at 23:59

I also don't have a definite answer, but one way to think of it is that you want a point that minimizes $h(u)$ but is also close to $x$ (the argument of prox). There is usually also a constant that dictates the tradeoff between these two components of the objective. The prox operator is very important in convex analysis and its properties are well studied. Take a look at [this monograph](http://www.stanford.edu/~boyd/papers/pdf/prox_algs.pdf) by Boyd for more details. – rnegrinho Jun 12 '14 at 11:42
3 Answers
Here are two interpretations which can lend useful intuition in some circumstances:
Prox is a generalized projection
Consider the indicator of some nonempty convex set $X$: $$i_X(x) = \begin{cases}0 & \text{if } x \in X \\ \infty & \text{if } x \not \in X\end{cases}.$$ Then we find that the proximal mapping is the projection onto $X$: $$\begin{align}\text{prox}_{\alpha i_X}(y) &= \underset{x}{\text{argmin}}(\alpha i_X(x)+\frac{1}{2}\x  y\^2) \\ &= \underset{x \in X}{\text{argmin}}(\frac{1}{2}\x  y\^2) \\ &= \text{proj}_X(y).\end{align}$$
Prox is backward Euler
Consider the backward Euler update: $x^{k+1} = x^k  \alpha \nabla f(x^{k+1})$ for some differentiable $f$. This update can be viewed as a proximal point update $x^{k+1} = \text{prox}_{\alpha f}(x^k)$, since for differentiable convex functions the optimality condition is $0 = \nabla (\alpha f(x^\star) + \frac{1}{2}\x^\star  x^k\^2) = \alpha \nabla f(x^\star) + x^\star  x^k$.
 191
 1
 6
If $h$ is a closed convex function, then the proximal operator of $h$ (with parameter $t > 0$) is defined by $$ \text{prox}_{th}(\hat x) = \arg \min_x \, h(x) + \frac{1}{2t} \x  \hat x \_2^2. $$ When we evaluate the proxoperator of $h$, we are attempting to reduce the value of $h$, but we are penalized for straying too far from $\hat x$. (The parameter $t$ is like a "step size" that controls how much we are penalized for moving away from $\hat x$.)
You can see that evaluating a proxoperator might be a useful substep in an optimization algorithm. For example, suppose that we want to solve the optimization problem $$ \text{minimize} \quad g(x) + h(x) $$ where $g$ and $h$ are convex, $g$ is differentiable, and $h$ is "simple" in the sense that its proxoperator can be evaluated efficiently. (Importantly, $h$ is not required to be differentiable.) A natural strategy is to first reduce the value of $g$ by taking a step in the negative gradient direction, then reduce the value of $h$ by applying the proxoperator of $h$, and repeat. This strategy yields the following iteration: $$ x^{k+1} = \text{prox}_{th}(x^k  t \nabla g(x^k)). $$ (This iteration is known as the proximal gradient method. You could implement it right now in Python to solve famous optimization problems such as the Lasso problem from statistics.)
Here is an alternative way to derive the proximal gradient method and, in the process, discover the proximal operator. Suppose again that we want to minimize $g(x) + h(x)$, with $g$ and $h$ as given above. A natural strategy is to replace $g$ with a local linear approximation to $g$ (this is the fundamental strategy of calculus), and also to add a penalty term to the objective function which penalizes us from straying too far from the current iterate (this ensures that the local linear approximation to $g$ remains accurate). This strategy yields the following iteration: $$ x^{k+1} = \arg \min_x \quad h(x) + \underbrace{g(x^k) + \nabla g(x^k)^T(x  x^k)}_{\approx \, g(x)} + \frac{1}{2t} \x  x^k \_2^2. $$ Now if we combine the two terms on the right by completing the square, and then omit terms that do not depend on $x$, we find that \begin{align} x^{k+1} &= \arg \min_x \, h(x) + \frac{1}{2t} \x  (x^k  t\nabla g(x^k)) \_2^2 \\ &= \text{prox}_{th}(x^k  t\nabla g(x^k)). \end{align} This is the proximal gradient iteration. We see that the proximal operator of $h$ has appeared out of thin air. To me this seems like the most likely way that someone would discover the idea of a proximal operator.
Another motivation for the proximal operator comes from the viewpoint of monotone operator theory. Minimizing a closed convex function $h$ is equivalent to finding a point $x$ such that $$ \tag{1} 0 \in \partial h(x). $$ Here $\partial h(x)$ is the subdifferential of $h$ at $x$. The setvalued mapping $x \mapsto \partial h(x)$ is the best example of a "monotone operator", and problem (1) is called a "monotone inclusion problem". Multiplying both sides by $t$ and then adding $x$ to both sides, we find that \begin{align} & 0 \in \partial h(x) \\ \iff & x \in (I + t \partial h)(x) \\ \iff & x \in (I + t \partial h)^{1}(x). \end{align} The operator $(I + t \partial h)^{1}$ is called the "resolvent" of the operator $\partial h$, and it has nice properties such as the following: if $x \in \mathbb R^n$, then $(I + t \partial h)^{1}(x)$ is a singleton. Thus, $(I + t \partial h)^{1}$ can be viewed as a function (rather than a setvalued mapping), and solving $0 \in \partial h(x)$ is equivalent to solving
$$ x = (I + t \partial h)^{1}(x). $$ A natural idea is to solve for $x$ using fixed point iteration. The resulting iteration is $$ x^{k+1} = (I + t \partial h)^{1}(x^k). $$
Punch line: The function $(I + t \partial h)^{1}$ is none other than the proximal operator of $h$, and this iteration is known as the "proximal point algorithm".
 48,104
 8
 84
 154

In the second line after the second $0 \in \partial h(x)$: shouldn't it be $x= (I+t\partial h)^{1}(x)$ instead of $x\in (I+t\partial h)^{1}(x)$? – Broken_Window Feb 04 '19 at 21:07

1@Broken_Window My thought process: at that point in my exposition, it wasn't clear that $(I + t \partial h)^{1}$ can be viewed as a function rather than as a setvalued mapping (because usually the inverse of a setvalued mapping is a setvalued mapping). So I wrote $x \in (I + t \partial h)^{1}(x)$. But then we note that one of the nice properties of this resolvent operator that we have stumbled upon is that $(I + t \partial h)^{1}(x)$ is a singleton. Thus, at that point, we rewrite the previous equation as $x = (I + t \partial h)^{1}(x)$. – littleO Feb 04 '19 at 22:18

Thanks for the explanation, I went confused because some books make the expression $(I+t\partial h)^{1}x$ equal to the proximal map of $h(x)$. – Broken_Window Feb 06 '19 at 22:44
There are many good answers here already, but for the sake of completeness I will add on more intuition. The proximal operator of $g$ (with parameter $\lambda$) can also be seen as a gradient step (with stepsize $\lambda$) with respect to the Moreau envelope $g_\lambda$ of $g$. The Moreau envelope is a smooth under approximation of the original function and is given by $$ g_\lambda(x) = \min_u \{g(u) + \frac{1}{2 \lambda}  ux^2\}. $$ Evidently its definition is closely related to the proximal operator. The above mentioned identity then reads as $$ \operatorname{prox}_{\lambda g}(x) = x  \lambda \nabla g_\lambda(x). $$
 390
 1
 12