The Chain Rule

Theorem (Chain Rule)
Suppose \(U \subset {\mathbb R}^n\) is open and \(f : U \rightarrow {\mathbb R}^m\) is differentiable at \(p \in U\). Suppose also that \(g\) is a function defined on a neighborhood of \(q := f(p)\) with values in \({\mathbb R}^\ell\) which is differentiable at \(q\). Then \(g(f(x))\) is differentiable at \(p\) and \(D_p (g \circ f) = (D_{f(p)} g )(D_p f)\).
Proof of the Chain Rule Let
\[ R_p(x-p) := \begin{cases} \frac{f(x) - f(p) - (D_p f) (x-p)}{||x-p||} & x \neq p, \\ 0 & x = p. \end{cases}\]
\[ \tilde R_q(y-q) := \begin{cases} \frac{g(y) - g(q) - (D_q g) (y-q)}{||y-q||} & y \neq q, \\ 0 & y = q. \end{cases}\]
The key property of \(R_p\) is this: For all \(\epsilon > 0\), there exists \(\delta > 0\) such that \(||x-p|| < \delta\) implies \(||R_p(x-p)|| < \epsilon\). The analogous property holds for \(\tilde R_q\) by differentiability of \(g\) at \(q\).
Meta (First Main Idea)
We think of differentiability as corresponding to the property that the function equals a linear function plus something which decays faster than linearly near the point.

By differentiability of \(f\),
\[{}f(x){}\]
\[{}= f(p){}\]
\[{}+ D_p f (x-p){}\]
\[{}+ R_p(x-p) ||x-p||{}\]
\[{}= q{}\]
\[{}+ D_p f (x-p){}\]
\[{}+ R_p(x-p) ||x-p||{}\]
for all \(x\) near \(p\) and by differentiability of \(g\),
\[{}g(y){}\]
\[{}= g(q){}\]
\[{}+ D_q g (y-q){}\]
\[{}+ \tilde R_q(y-q) ||y-q||{}\]
for all \(y\) near \(q\). Substituting the former into the latter gives that \(g(f(x))\) equals
\[{}g(q){}\]
\[{}+ (D_q g D_p f) (x-p) {}\]
\[{}+ D_q g R_p(x-p) ||x-p||{}\]
\[{}+ \tilde R_q(f(x) - f(p)) \Big|\Big| f(x) - f(p) \Big| \Big|{}\]
\[{}= g(q) + (D_q g D_p f)(x-p){}\]
\[{}+ (\operatorname{I}(x-p) + \operatorname{II}(x-p)) ||x-p||{}\]
with
\[{}\operatorname{I}(x-p){}\]
\[{}:= D_q g (R_p(x-p)),{}\]
\[{}\operatorname{II}(x-p){}\]
\[{}:= \frac{\tilde R_q(f(x) - f(p))}{||x-p||}{}\]
\[{}\cdot \Big|\Big| f(x)-f(p) \Big| \Big|.{}\]
We can also take the convention that both \(\operatorname{I}\) and \(\operatorname{II}\) vanish when \(x=p\). (Note that continuity of \(f\) implies that \(f(x)\) is near \(q\) when \(x\) is sufficiently near \(p\), so the composition is well-defined.)
Meta (Second Main Idea)
We have already written \(g(f(x))\) as a linear function of \(x\) times something, so it suffices to show that the something decays faster than linearly. In this case, this means showing that \(\operatorname{I}\) and \(\operatorname{II}\) can be made smaller than any fixed \(\epsilon\) by taking \(x\) sufficiently close to \(p\).
  • Because \(R_p(x-p) \rightarrow 0\) as \(x \rightarrow p\), there exists \(\delta_1\) such that \(||x-p|| < \delta_1\) implies
    \[ ||R_p (x-p)|| < \frac{\epsilon}{2 (||D_q g|| + 1)}\]
    For any such \(x\), it follows that \(||\operatorname{I}(x-p)|| < \epsilon/2\).
  • Just by boundedness of \(D_p f\) and the fact that \(||R_p (x-p)|| \leq 1\) for \(x\) sufficiently near \(p\), there exists \(\delta_2\) such that \(||x-p|| < \delta_2\) implies that
    \[{}|| f(x) - f(p) ||{}\]
    \[{}= \Big|\Big| D_p f (x-p) + R_p(x-p) ||x-p|| \Big| \Big|{}\]
    \[{}\leq ( ||D_p f|| + 1) ||x-p||.{}\]
  • Because \(\tilde R_q(y-q) \rightarrow 0\) as \(y \rightarrow q\), there exists \(\eta\) such that \(||y-q|| < \eta\) implies
    \[ ||\tilde R_q (y-q)|| < \frac{\epsilon}{2(1 + ||D_p f||)}. \]
So if
\[ ||x-p|| < \min \left\{ \delta_1,\delta_2, \frac{\eta}{1 + ||D_p f||} \right\}, \]
then
\[{}||f(x) - q||{}\]
\[{}= \Big|\Big| D_p f (x-p) + R_p(x-p) ||x-p|| \Big|\Big|{}\]
is less than \(\eta\) (because it is less than \((||D_p f||+1) ||x-p||\), so in particular \(||f(x) - q||\) is small enough that \(||\tilde R_q(f(x) - f(p))|| < \epsilon / (2 (1 + ||D_p f|||))\). Combining these observations gives that \(||\operatorname{II}(x-p)|| \leq \epsilon/2\). Consequently, for all such \(x\neq p\), it follows that
\[ \frac{\left| \left| g(f(x)) - g(q) - (D_q g D_p f)(x-p)\right|\right|}{||x-p||} < \epsilon. \]
This is exactly what is asserted by the Chain Rule (because the total derivative is unique when it exists).