Optimization and Lagrange Multipliers

Theorem (Lagrange Multipliers)
Suppose \(U \subset {\mathbb R}^n\) is open. Let \(F : U \rightarrow {\mathbb R}^m\) (where \(m < n\)) be \(C^1\) and suppose that \(DF\) has full rank \(m\) at every point of \(U\). Consider the set
\[ \Sigma := \left\{x \in U \ : \ F(x) = c \right\}\]
for some fixed \(c\). If \(\Sigma\) is nonempty and if \(f\) is a \(C^1\) function on \(U\) which attains a maximum on \(\Sigma\), i.e., such that there exists \(x \in \Sigma\) such that \(f(x) = \sup_{y \in \Sigma} f(x)\), then there exists \(\lambda \in {\mathbb R}^n\) such that
\[ \frac{\partial f}{\partial x_i}(x) + \sum_{j=1}^m \lambda_j \frac{\partial F_j}{\partial x_i} (x) = 0 \tag{1}\]
for all \(i=1,\ldots,n\), where \(F_1,\ldots,F_m\) are the coordinates of \(F\), i.e., \(F(x) = (F_1(x),\ldots,F_m(x))\).
Meta (Idea of Proof)
The condition (1) can be interpreted as saying that the gradients \(\nabla f\) and \(\nabla F_1,\ldots,\nabla F_m\) are linearly dependent. (Since we already know that the gradients \(\nabla F_1,\ldots,\nabla F_m\) are linearly independent, we know that the coefficient in front of \(\nabla f\) cannot be zero and thus we can scale it to be equal to \(1\).) We start by invoking the Implicit Function Theorem.
Proof (Part 1)
Let \(x\) be the point in \(\Sigma\) at which the maximum on \(\Sigma\) happens to be attained. By the Implicit Function Theorem, we know that there is a \(C^1\) function \(\varphi\) on a ball \(B \subset {\mathbb R}^{n-m}\) such that \(\varphi\) parametrizes \(\Sigma\) near the point \(x\) (i.e., \(D \varphi\) is full rank, \(\varphi\) is injective, and \(\varphi(B)\) is exactly the set of points in \(\Sigma\) which lie in some neighborhood of \(x\)). We may assume without loss of generality that \(B\) is a neighborhood of the origin in \({\mathbb R}^{n-m}\) and \(\varphi(0) = x\). This means that \(f ( \varphi(t))\) has a local maximum at \(0\). This means that the \(t\)-variable gradient of \(f \circ \varphi(t)\) equals the zero vector when \(t = 0\).
Meta (Idea Continued)
Now we use the use the fact that \(F_i \circ \varphi(t)\) also has a critical point at \(t=0\) for each \(i=1,\ldots,m\). Each \(F_i\) has a nonzero gradient, and all together the gradients point in as many linearly-independent directions as possible (namely \(m\)), so the gradient of \(f\) must be linearly dependent.
Proof (Part 2)
By the Chain Rule, we have that
\[ 0 = \left. \frac{\partial f \circ \varphi}{\partial t_i} \right|_{t=0} = \sum_{j=0}^n \frac{\partial f}{\partial x_j} \frac{\partial \varphi_j(0)}{\partial t_i}\]
for all \(i = 1,\ldots,n-m\). Let \(\nabla f(x) = (\partial_{x_1} f(x),\ldots,\partial_{x_n} f(x))\) and let \(v_i = ((\partial_{t_i} \varphi_1)(0),\ldots,(\partial_{t_i} \varphi_n)(0))\). Our calculation gives that
\[ \nabla f(x) \cdot v_i = 0 \qquad \forall i = 1,\ldots n-m. \]
Interpreting \(v_1,\ldots,v_{n-m}\) as row vectors, they are exactly the rows of \(D \varphi\) at \(t=0\). Full rank \(\Rightarrow v_1,\ldots,v_{n-m}\) are linearly independent. So \(\nabla f(x)\) must belong to the orthogonal complement of \(\text{span}\{v_1,\ldots,v_{n-m}\}\).

Also we know that \(F_1,\ldots,F_{m}\) are constant functions on \(\Sigma\). This means that \(F_i \circ \varphi(t)\) is constant in \(t\) for each \(i=1,\ldots,n\). Just as above, this means that \(\nabla F_i(x)\) also belongs to the orthogonal complement of \(\text{span}\{v_1,\ldots,v_{n-m}\}\) for each \(i=1,\ldots,m\).

As row vectors, each \(\nabla F_i\) is a row of \(DF\), so \(\nabla F_1(x),\ldots, \nabla F_m(x)\) are also linearly independent. The dimension of the orthogonal complement of \(\text{span}\{v_1,\ldots,v_{n-m}\}\) is \(m\), so this makes \(\nabla F_1(x),\ldots, \nabla F_m(x)\) a basis. This means that \(\nabla f\) is a linear combination of \(\nabla F_1(x),\ldots, \nabla F_m(x)\), which is exactly the conclusion of the theorem.