Following some text books, the Lagrange multiplier theorem can be described as follows.

Let $U \subset \mathbb{R}^n$ be an open set and let $f:U\rightarrow \mathbb{R}, g:U\rightarrow \mathbb{R}$ be $C^1$ functions. Let $\mathbf{x}_{0} \in U, c=g(\mathbf{x}_0)$, and let $S$ be the level set of $g$ with value $c$. Suppose that $\nabla g(\mathbf{x}_0) \neq 0$.
If the function $f|_S$ has a local extremum at $\mathbf{x}_0$ then there is a $\lambda \in \mathbb{R}$ so that $\nabla f(\mathbf{x}_0)=\lambda g(\mathbf{x}_0)$

Some textbooks give a strict proof using implicit function theorem (see proof). The proof is very clear, however I can't build some intuition from the proof. Can someone help me prove this theorem in a rigorous but more intuitive way? Thanks.

  • 485
  • 3
  • 15
  • 451
  • 5
  • 5
  • In what perspective are you going to use this theorem? – Mambo Apr 27 '16 at 09:26
  • I would like to understand Lagrange multiplier theorem and KKT condition in a better way. I think it will help me to understand the optimization algorithms. – seaplusplus Apr 27 '16 at 10:04

3 Answers3


Here's some intuition.

First suppose that $x_0$ is a local minimizer for the problem \begin{align} \tag{$\spadesuit$} \text{minimize} & \quad f(x) \\ \text{subject to} &\quad Ax = b. \end{align} If $Au = 0$, then the directional derivative $D_u f(x_0) = \langle \nabla f(x_0),u \rangle$ is nonnegative. Otherwise, it would be possible to improve $x_0$ by perturbing it a bit in the direction $u$.

It follows that if $Au = 0$, then $\langle \nabla f(x_0),u \rangle = 0$.

So, $\nabla f(x_0)$ is in the orthogonal complement of the null space of $A$.

BUT, according to the "four subspaces theorem" (which is beloved by Gilbert Strang), the orthogonal complement of the null space of $A$ is the range of $A^T$. Thus, $\nabla f(x_0)$ is in the range of $A^T$. In other words, there exists a vector $\lambda$ such that \begin{equation} \nabla f(x_0) = A^T \lambda. \end{equation} This is our Lagrange multiplier optimality condition, in the case where we have linear equality constraints.

Next, suppose that $x_0$ is a local minimizer for the problem \begin{align} \text{minimize} & \quad f(x) \\ \text{subject to} &\quad g(x) = 0, \end{align} where $g:\mathbb R^n \to \mathbb R^m$.

It seems plausible that $x_0$ is also a local minimizer for the problem we get by replacing $g$ with the local linear approximation to $g$ at $x_0$: \begin{align} \text{minimize} & \quad f(x) \\ \text{subject to} &\quad g(x_0) + g'(x_0)(x - x_0) = 0. \end{align} (Here $g'(x_0)$ is the Jacobian matrix of $g$ at $x_0$.) Assuming that $x_0$ is indeed a local minimizer for this modified problem, it follows that $x_0$ satisfies the optimality condition derived above: \begin{align} \nabla f(x_0) &= g'(x_0)^T \lambda \\ &= \nabla g_1(x_0) \lambda_1 + \cdots + \nabla g_m(x_0) \lambda_m \end{align} for some vector $\lambda = \begin{bmatrix} \lambda_1 & \ldots & \lambda_m \end{bmatrix}^T \in \mathbb R^m$. (Here $g_1,\ldots,g_m$ are the component functions of $g$, and we use the convention that the gradient is a column vector.) This is our Lagrange multiplier optimality condition in the case of nonlinear equality constraints.

I believe it's possible to view the proof using the implicit function theorem as a rigorous version of this intuition.

Edit: Now I'll show how a similar approach can handle inequality constraints, if we replace the "four subspaces theorem" with Farkas' lemma (which I call the four cones theorem).

Let $x^*$ be a local minimizer for the problem \begin{align} \tag{$\heartsuit$} \text{minimize} & \quad f(x) \\ \text{subject to} & \quad Ax \leq b. \end{align} Suppose that $u \in \mathbb R^n$ and $Au \leq 0$. Then the directional derivative $D_uf(x^*) = \langle \nabla f(x^*),u \rangle \geq 0$. Otherwise, we could improve $x^*$ by perturbing it a bit in the direction of $u$.

At this point, we would like to invoke something like the four subspaces theorem, to find some optimality condition that $\nabla f(x^*)$ must satisfy. While the four subspaces theorem is inapplicable here, there is a one-sided version of the four subspaces theorem that fits this situation perfectly.

The four cones theorem

We first recall a few facts from convex analysis. If $C \subset \mathbb R^n$ is a cone, then the polar of $C$ is defined by \begin{equation} C^\circ = \{z \mid \langle x,z \rangle \leq 0 \,\, \forall \,\, x \in C \}. \end{equation} Note that $C^\circ$ is a closed convex cone.
A convex cone is a "one-sided" version of a subspace, and the polar cone is a "one-sided" version of the orthogonal complement. Instead of $\langle x,z \rangle = 0$, we have $\langle x, z \rangle \leq 0$. Here's some support for this analogy:

  • For a subspace $W$ of $\mathbb R^n$, we have the familiar equation $W = W^{\perp \perp}$. The one-sided version of this is that if $C$ is a closed convex cone, then $C = C^{\circ \circ}$.
  • If $W$ is a subspace of $\mathbb R^n$, then any $x \in \mathbb R^n$ can be decomposed as $x = \Pi_W(x) + \Pi_{W^\perp}(x)$. The one-sided version of this is that if $C \subset \mathbb R^n$ is a closed convex cone, then any $x \in \mathbb R^n$ can be decomposed as $x = \Pi_C(x) + \Pi_{C^\circ}(x)$. Moreover, $\Pi_C(x)$ is orthogonal to $\Pi_{C^\circ}(x)$.

Let $\mathcal R_+(A^T) = \{ A^T z \mid z \geq 0 \}$. So $\mathcal R_+(A^T)$ is a ``one-sided version'' of the column space of $A^T$. I call it the column cone of $A^T$. Let $ \mathcal N_-(A) = \{x \mid Ax \leq 0 \}$. So $\mathcal N_-(A)$ is a one-sided version of the null space of $A$. I call it the null cone of $A$. Note that $\mathcal N_-(A)$ and $\mathcal R_+(A^T)$ are both closed convex cones.

The four subspaces theorem can be stated as \begin{equation} \mathcal N(A)^\perp = \mathcal R(A^T). \end{equation} Here is our "one-sided" version of the four subspaces theorem: \begin{equation} \mathcal N_-(A)^\circ = \mathcal R_+(A^T). \end{equation} I call this the "four cones theorem".

Invoking the four cones theorem

We can use the "four cones theorem" to find an optimality condition for problem ($\heartsuit$), just as we used the four subspaces theorem to find an optimality condition for problem ($\spadesuit$). As noted previously, if $A u \leq 0$, then $\langle \nabla f(x^*), u \rangle \geq 0$. In other words, $-\nabla f(x^*) \in \mathcal N_-(A)^\circ$. By the four cones theorem, it follows that $-\nabla f(x^*) \in \mathcal R_+(A^T)$. Thus, there exists a vector $z \geq 0$ such that \begin{equation} \nabla f(x^*) + A^T z = 0. \end{equation} This, together with $Ax^* \leq b$, is our optimality condition (KKT condition) for the problem ($\heartsuit$) with linear inequality constraints.

Connection with Farkas's lemma

The four cones theorem can be stated as a ``theorem of the alternative''. If $g \in \mathbb R^n$, then $g$ either does or does not belong to $\mathcal R_+(A^T)$. In other words, either $g$ belongs to $\mathcal R_+(A^T)$, or else $g$ does not belong to the polar of $\mathcal N_-(A)$. This means that one and only one of the following two alternatives is true:

  1. There exists $z \geq 0$ such that $g = A^T z$.
  2. There exists $u$ such that $Au \leq 0$ and $\langle g, u \rangle > 0$.

This is equivalent to the statement of Farkas's lemma given in Boyd and Vandenberghe. The four cones theorem is Farkas's lemma.

Visualizing the four cones theorem

The "alternative" statement of the four cones theorem suggests a way of visualizing the four cones theorem that makes it seem obvious. The first alternative is that $g$ belongs to the convex cone generated by the columns of $A^T$. The second alternative is that there is a hyperplane through the origin that separates $g$ from the columns of $A^T$. (The vector $u$ is a normal vector to this separating hyperplane.)

Problem with linear equality and inequality constraints

This case is similar to the previous cases, except that we will use a different version of Farkas's lemma that involves a combination of equality and inequality constraints.

  • 48,104
  • 8
  • 84
  • 154
  • Very helpful to me. Can this intuition be generalized to understand the proof of KKT condition? – seaplusplus Apr 27 '16 at 08:03
  • 1
    Yes, and it's a beautiful story that's rarely told. We replace the four subspace theorem with Farkas' lemma (the four cone theorem). I added notes about this. – littleO Apr 27 '16 at 09:01
  • 3
    +1 very nice! I haven't fully understood the case with inequality constraints but I understand your (classic) strategy: work with linear function first, then work with general function (but locally approximately linear) – SiXUlm Apr 27 '16 at 10:02
  • 1
    There is a tricky point in the intuition: we say (suppose we're talking about minimization, for concreteness) that if the gradient of $f$ were not parallel to the gradient of $g$, then we could find a vector which is orthogonal to the gradient of $g$ and which has a negative inner product with the gradient of $f$, and improve our candidate by moving in that direction. The tricky thing is that we can't move in that direction, strictly speaking, because those points are not on the surface (in the nonlinear constraint case). – Ian Apr 27 '16 at 15:21
  • To make sense of this directly without explicitly invoking the implicit function theorem, you should estimate how far away you are from the surface when you move along a tangent direction, and use that to conclude that if you project from the tangent direction down to the surface, you still decrease the objective function, provided you moved a small enough distance. – Ian Apr 27 '16 at 15:21
  • In the case where the constraint function is *twice* differentiable I think this is quite intuitive. In this case the distance between the point along the tangent from the surface is $O(h^2)$ and so the objective function is reduced by $O(h)$ and then increased by at most $O(h^2)$, which will be a net negative change for sufficiently small $h$. – Ian Apr 27 '16 at 15:23
  • @littleO I am in desperate need of help on a problem where I need to find the convex cone of a set. Can you please help me with it? https://math.stackexchange.com/questions/2449707/finding-the-polar-cone-of-the-given-cone-that-consists-of-vectors-in-mathbbr –  Sep 28 '17 at 23:37

Look at the tangent space to $S = g^{-1}(c)$ at $x_0$, say $S_{x_0}$. Then $S_{x_0} = \left[\nabla g(x_0) \right]^{\perp}$. Therefore, $S_{x_0}^\perp$ is the one-dimensional subspace spanned by $\nabla g(x_0)$. Now, $$\nabla f(x_0) = \lambda \nabla g(x_0)\text{ for some } \lambda \in \Bbb{R} \iff \nabla f(x_0) \in S_{x_0}^\perp \iff \nabla f(x_0).v = 0 \text{ for all } v \in S_{x_0}.$$ But every $v \in S_{x_0}$ is of the form $v = \dot\alpha(t_0)$ for some parametrized curve $\alpha:I \to S$ and $t_0 \in I$ with $\alpha(t_0) =x_0$. Since $x_0$ is an extreme point of $f$ on $S$, $t_0$ is an extreme point of $f \circ \alpha$ on $I$. Hence $$0=(f \circ \alpha)'(t_0) = \nabla f(\alpha(t_0)).\dot\alpha(t_0) = \nabla f(x_0).v$$ for all $v \in S_{x_0}$ and so the result follows.

Caveat: Think where implicit function theorem is coming into this proof

  • 3
  • 1
  • 2
  • 485
  • 3
  • 15

Here's a different way to see that Lagrange multipliers are "intuitively" the correct way to go. To first order, the change in $f$ due to a change in $\vec r$ is simply:$$\delta f = f(\vec r + \delta\vec r) - f(\vec r) \approx \nabla f\cdot\delta\vec r.$$This has the obvious interpretation for $f$ above, "go in the direction of $\nabla f$ to increase $f$," but it also has an interpretation for constraints: to first order, the paths allowed by a constraint $g(\vec r) = c$ must be the ones such that $\nabla g \cdot \delta \vec r = 0,$ otherwise you would increase $c$ to first-order by following that tangent vector. That is, if you obey the constraint, we must have $\delta g = 0.$

Now we simply combine these two. At the constrained extremum of $f$ on $S,$ it must be the case that for all $\delta\vec r$ such that $\nabla g \cdot \delta \vec r = 0,$ we also have $\delta f = \nabla f\cdot\delta\vec r = 0,$ otherwise we can follow $\pm \delta\vec r$ to increase/decrease $f$, proving that this wasn't either sort of extremum on $S$ after all. So just to phrase the very condition for a constrained maximum, we would have to write it as $$\forall \vec a . (\nabla g \cdot \vec a = 0) \rightarrow (\nabla f \cdot \vec a = 0)$$

But that "for-all" is 100% sufficient to prove that they are proportional! Effectively, you can't have two linearly independent vectors $\vec a, \vec b$ such that the nullspaces of $(\vec a \cdot)$ and $(\vec b \cdot)$ are the same. Let's just prove this last aspect to complete the proof.

Consider $\vec p = \vec a - \alpha \vec b$ for $\alpha = \vec a \cdot \vec b / (\vec b \cdot \vec b)$, this satisfies by construction $\vec b \cdot \vec p = 0$ and therefore we must have $\vec a \cdot \vec p = 0$ due to the shared nullspace. But when you chug through the algebra of this it means ultimately that $(\vec a \cdot \vec a)(\vec b \cdot \vec b) = (\vec a \cdot \vec b)^2.$ That guarantees the = sign in a triangle inequality looking like either of $$|\vec a \pm \vec b|^2 = |\vec a|^2 + |\vec b|^2 \pm 2 (\vec a \cdot \vec b) ~\le~ (|\vec a| + |\vec b|)^2 = |\vec a|^2 + |\vec b|^2 + 2 |\vec a||\vec b|.$$ We know that those equals signs only hold when the vectors are degenerate, that is, $\vec a = \lambda \vec b.$

CR Drost
  • 2,626
  • 11
  • 13