Here's some intuition.

First suppose that $x_0$ is a local minimizer for the problem
\begin{align}
\tag{$\spadesuit$} \text{minimize} & \quad f(x) \\
\text{subject to} &\quad Ax = b.
\end{align}
If $Au = 0$, then the directional derivative $D_u f(x_0) = \langle \nabla f(x_0),u \rangle$ is nonnegative. Otherwise, it would be possible to *improve* $x_0$ by perturbing it a bit in the direction $u$.

It follows that if $Au = 0$, then $\langle \nabla f(x_0),u \rangle = 0$.

So, $\nabla f(x_0)$ is in the orthogonal complement of the null space of $A$.

BUT, according to the "**four subspaces theorem**" (which is beloved by Gilbert Strang), the orthogonal complement of the null space of $A$ is the range of $A^T$. Thus, $\nabla f(x_0)$ is in the range of $A^T$. In other words, there exists a vector $\lambda$ such that
\begin{equation}
\nabla f(x_0) = A^T \lambda.
\end{equation}
This is our Lagrange multiplier optimality condition, in the case where we have linear equality constraints.

Next, suppose that $x_0$ is a local minimizer for the problem
\begin{align}
\text{minimize} & \quad f(x) \\
\text{subject to} &\quad g(x) = 0,
\end{align}
where $g:\mathbb R^n \to \mathbb R^m$.

It seems plausible that $x_0$ is also a local minimizer for the problem we get by replacing $g$ with the local linear approximation to $g$ at $x_0$:
\begin{align}
\text{minimize} & \quad f(x) \\
\text{subject to} &\quad g(x_0) + g'(x_0)(x - x_0) = 0.
\end{align}
(Here $g'(x_0)$ is the Jacobian matrix of $g$ at $x_0$.)
Assuming that $x_0$ is indeed a local minimizer for this modified problem,
it follows that $x_0$ satisfies the optimality condition derived above:
\begin{align}
\nabla f(x_0) &= g'(x_0)^T \lambda \\
&= \nabla g_1(x_0) \lambda_1 + \cdots + \nabla g_m(x_0) \lambda_m
\end{align}
for some vector $\lambda = \begin{bmatrix} \lambda_1 & \ldots & \lambda_m \end{bmatrix}^T \in \mathbb R^m$.
(Here $g_1,\ldots,g_m$ are the component functions of $g$, and we use the convention that the gradient is a column vector.)
This is our Lagrange multiplier optimality condition in the case of nonlinear equality constraints.

I believe it's possible to view the proof using the implicit function theorem as a rigorous version of this intuition.

**Edit:** Now I'll show how a similar approach can handle inequality constraints, if we replace the "four subspaces theorem" with Farkas' lemma (which I call the **four cones theorem**).

Let $x^*$ be a local minimizer for the problem
\begin{align}
\tag{$\heartsuit$} \text{minimize} & \quad f(x) \\
\text{subject to} & \quad Ax \leq b.
\end{align}
Suppose that $u \in \mathbb R^n$ and $Au \leq 0$.
Then the directional derivative
$D_uf(x^*) = \langle \nabla f(x^*),u \rangle \geq 0$.
Otherwise, we could *improve* $x^*$ by perturbing it
a bit in the direction of $u$.

At this point, we would like to invoke something like the four subspaces theorem,
to find some optimality condition that $\nabla f(x^*)$ must satisfy.
While the four subspaces theorem is inapplicable here, there is a one-sided version
of the four subspaces theorem that fits this situation perfectly.

## The four cones theorem

We first recall a few facts from convex analysis.
If $C \subset \mathbb R^n$ is a cone, then the *polar*
of $C$ is defined by
\begin{equation} C^\circ = \{z \mid \langle x,z \rangle \leq 0 \,\, \forall \,\, x \in C \}.
\end{equation}
Note that $C^\circ$ is a closed convex cone.

A convex cone is a "one-sided" version of a subspace, and the polar cone
is a "one-sided" version of the orthogonal complement.
Instead of $\langle x,z \rangle = 0$, we have $\langle x, z \rangle \leq 0$.
Here's some support for this analogy:

- For a subspace $W$ of $\mathbb R^n$, we have the familiar equation
$W = W^{\perp \perp}$.
The one-sided version of this is that if $C$ is a closed convex cone, then $C = C^{\circ \circ}$.
- If $W$ is a subspace of $\mathbb R^n$, then any $x \in \mathbb R^n$ can be decomposed as
$x = \Pi_W(x) + \Pi_{W^\perp}(x)$. The one-sided version of this is that
if $C \subset \mathbb R^n$ is a closed convex cone, then any $x \in \mathbb R^n$ can be decomposed
as $x = \Pi_C(x) + \Pi_{C^\circ}(x)$. Moreover, $\Pi_C(x)$ is orthogonal to $\Pi_{C^\circ}(x)$.

Let $\mathcal R_+(A^T) = \{ A^T z \mid z \geq 0 \}$.
So $\mathcal R_+(A^T)$ is a ``one-sided version'' of the column space of $A^T$. I call it the **column cone** of $A^T$.
Let $ \mathcal N_-(A) = \{x \mid Ax \leq 0 \}$.
So $\mathcal N_-(A)$ is a one-sided version of the null space of $A$.
I call it the **null cone** of $A$.
Note that $\mathcal N_-(A)$ and $\mathcal R_+(A^T)$ are both closed convex cones.

The four subspaces theorem can be stated as
\begin{equation}
\mathcal N(A)^\perp = \mathcal R(A^T).
\end{equation}
Here is our "one-sided" version of the four subspaces theorem:
\begin{equation}
\mathcal N_-(A)^\circ = \mathcal R_+(A^T).
\end{equation}
I call this the "four cones theorem".

## Invoking the four cones theorem

We can use the "four cones theorem" to find an optimality condition
for problem ($\heartsuit$), just as we used the four subspaces theorem
to find an optimality condition for problem ($\spadesuit$).
As noted previously, if $A u \leq 0$, then $\langle \nabla f(x^*), u \rangle \geq 0$.
In other words, $-\nabla f(x^*) \in \mathcal N_-(A)^\circ$.
By the four cones theorem, it follows that
$-\nabla f(x^*) \in \mathcal R_+(A^T)$.
Thus, there exists a vector $z \geq 0$ such that
\begin{equation}
\nabla f(x^*) + A^T z = 0.
\end{equation}
This, together with $Ax^* \leq b$, is our optimality condition
(KKT condition) for the problem ($\heartsuit$) with linear inequality constraints.

## Connection with Farkas's lemma

The four cones theorem can be stated as a ``theorem of the alternative''.
If $g \in \mathbb R^n$, then $g$ either does or does not belong to $\mathcal R_+(A^T)$.
In other words, either $g$ belongs to $\mathcal R_+(A^T)$, or else $g$ does not belong
to the polar of $\mathcal N_-(A)$. This means that one and only one of the following two alternatives is true:

- There exists $z \geq 0$ such that $g = A^T z$.
- There exists $u$ such that $Au \leq 0$ and $\langle g, u \rangle > 0$.

This is equivalent to the statement of Farkas's lemma given in Boyd and Vandenberghe.
The four cones theorem *is* Farkas's lemma.

## Visualizing the four cones theorem

The "alternative" statement of the four cones theorem suggests a way of visualizing
the four cones theorem that makes it seem obvious.
The first alternative is that $g$ belongs to the convex cone generated by the
columns of $A^T$. The second alternative is that there is a hyperplane through the origin
that separates $g$ from the columns of $A^T$. (The vector $u$ is a normal
vector to this separating hyperplane.)

## Problem with linear equality and inequality constraints

This case is similar to the previous cases, except that we will use
a different version of Farkas's lemma that involves a combination of
equality and inequality constraints.