The Chain Rule

Theorem (Chain Rule)

Suppose \(U \subset {\mathbb R}^n\) is open and \(f : U \rightarrow {\mathbb R}^m\) is differentiable at \(p \in U\). Suppose also that \(g\) is a function defined on a neighborhood of \(q := f(p)\) with values in \({\mathbb R}^\ell\) which is differentiable at \(q\). Then \(g(f(x))\) is differentiable at \(p\) and \(D_p (g \circ f) = (D_{f(p)} g )(D_p f)\).

Proof of the Chain Rule Let

\[ R_p(x-p) := \begin{cases} \frac{f(x) - f(p) - (D_p f) (x-p)}{||x-p||} & x \neq p, \\ 0 & x = p. \end{cases}\]

\[ \tilde R_q(y-q) := \begin{cases} \frac{g(y) - g(q) - (D_q g) (y-q)}{||y-q||} & y \neq q, \\ 0 & y = q. \end{cases}\]

The key property of \(R_p\) is this: For all \(\epsilon > 0\), there exists \(\delta > 0\) such that \(||x-p|| < \delta\) implies \(||R_p(x-p)|| < \epsilon\). The analogous property holds for \(\tilde R_q\) by differentiability of \(g\) at \(q\).

Meta (First Main Idea)

We think of differentiability as corresponding to the property that the function equals a linear function plus something which decays faster than linearly near the point.

By differentiability of \(f\),

\[{}f(x){}\]

\[{}= f(p){}\]

\[{}+ D_p f (x-p){}\]

\[{}+ R_p(x-p) ||x-p||{}\]

\[{}= q{}\]

\[{}+ D_p f (x-p){}\]

\[{}+ R_p(x-p) ||x-p||{}\]

for all \(x\) near \(p\) and by differentiability of \(g\),

\[{}g(y){}\]

\[{}= g(q){}\]

\[{}+ D_q g (y-q){}\]

\[{}+ \tilde R_q(y-q) ||y-q||{}\]

for all \(y\) near \(q\). Substituting the former into the latter gives that \(g(f(x))\) equals

\[{}g(q){}\]

\[{}+ (D_q g D_p f) (x-p) {}\]

\[{}+ D_q g R_p(x-p) ||x-p||{}\]

\[{}+ \tilde R_q(f(x) - f(p)) \Big|\Big| f(x) - f(p) \Big| \Big|{}\]

\[{}= g(q) + (D_q g D_p f)(x-p){}\]

\[{}+ (\operatorname{I}(x-p) + \operatorname{II}(x-p)) ||x-p||{}\]

with

\[{}\operatorname{I}(x-p){}\]

\[{}:= D_q g (R_p(x-p)),{}\]

\[{}\operatorname{II}(x-p){}\]

\[{}:= \frac{\tilde R_q(f(x) - f(p))}{||x-p||}{}\]

\[{}\cdot \Big|\Big| f(x)-f(p) \Big| \Big|.{}\]

We can also take the convention that both \(\operatorname{I}\) and \(\operatorname{II}\) vanish when \(x=p\). (Note that continuity of \(f\) implies that \(f(x)\) is near \(q\) when \(x\) is sufficiently near \(p\), so the composition is well-defined.)

Meta (Second Main Idea)

We have already written \(g(f(x))\) as a linear function of \(x\) times something, so it suffices to show that the something decays faster than linearly. In this case, this means showing that \(\operatorname{I}\) and \(\operatorname{II}\) can be made smaller than any fixed \(\epsilon\) by taking \(x\) sufficiently close to \(p\).

Because \(R_p(x-p) \rightarrow 0\) as \(x \rightarrow p\), there exists \(\delta_1\) such that \(||x-p|| < \delta_1\) implies
\[ ||R_p (x-p)|| < \frac{\epsilon}{2 (||D_q g|| + 1)}\]
For any such \(x\), it follows that \(||\operatorname{I}(x-p)|| < \epsilon/2\).
Just by boundedness of \(D_p f\) and the fact that \(||R_p (x-p)|| \leq 1\) for \(x\) sufficiently near \(p\), there exists \(\delta_2\) such that \(||x-p|| < \delta_2\) implies that
\[{}|| f(x) - f(p) ||{}\]
\[{}= \Big|\Big| D_p f (x-p) + R_p(x-p) ||x-p|| \Big| \Big|{}\]
\[{}\leq ( ||D_p f|| + 1) ||x-p||.{}\]
Because \(\tilde R_q(y-q) \rightarrow 0\) as \(y \rightarrow q\), there exists \(\eta\) such that \(||y-q|| < \eta\) implies
\[ ||\tilde R_q (y-q)|| < \frac{\epsilon}{2(1 + ||D_p f||)}. \]

So if

\[ ||x-p|| < \min \left\{ \delta_1,\delta_2, \frac{\eta}{1 + ||D_p f||} \right\}, \]

then

\[{}||f(x) - q||{}\]

\[{}= \Big|\Big| D_p f (x-p) + R_p(x-p) ||x-p|| \Big|\Big|{}\]

is less than \(\eta\) (because it is less than \((||D_p f||+1) ||x-p||\), so in particular \(||f(x) - q||\) is small enough that \(||\tilde R_q(f(x) - f(p))|| < \epsilon / (2 (1 + ||D_p f|||))\). Combining these observations gives that \(||\operatorname{II}(x-p)|| \leq \epsilon/2\). Consequently, for all such \(x\neq p\), it follows that

\[ \frac{\left| \left| g(f(x)) - g(q) - (D_q g D_p f)(x-p)\right|\right|}{||x-p||} < \epsilon. \]

This is exactly what is asserted by the Chain Rule (because the total derivative is unique when it exists).