Automatic differentiation (AD), also known as “autodiff”, is a set of methods to compute the numerical values for partial derivatives. The 2 modes of AD are

  • Forward mode auto differentiation
  • Reverse mode auto differentiation

Composite functions

Recall the definition of the composite function (function of a function) in mathematics:

Composite function
Given 2 functions $f(x)$ and $g(x)$, we can compose a new function $h(x)$ when we apply function $g(x)$ over $f(x)$. $$\begin{equation}h(x) = g(f(x))\end{equation}$$

If $f(x,y)$ is a defined to be an addition operation $x+y$, and $g(x,y)$ is a multiplication operation $x * y$, the process of applying $g|_{x=f}$ over $f$ describes the sequence of operation in obtaining $h(x,y) = (x + y) * y$.

Similarly, $g|_{y=f}$ describes the sequence of operation in obtaining $h(x,y) = x * (x + y)$.

Forward mode

The forward mode auto differentiation (AD) evaluates the gradients in the same forward pass as the full expression evaluation, hence named “forward”. The gradient of each independent variable is propagated forward with the gradient of other variables set to 0. For instance, given a set of $n$ independent variables $x_1,x_2,…x_n$, the first forward pass is initialized with $\frac{\partial{x_1}}{\partial{x_1}}=1,\frac{\partial{x_1}}{\partial{x_2}},…\frac{\partial{x_1}}{\partial{x_n}}=0$. The second forward pass is initialized with $\frac{\partial{x_2}}{\partial{x_2}}=1,\frac{\partial{x_2}}{\partial{x_1}},…\frac{\partial{x_2}}{\partial{x_n}}=0$. So with $n$ independent variables, there are $n$ forward passes. This process is a bit like signal decomposition in signals and systems whereby a signal can be reconstructed by a summing up sines and cosines. The analogy is that if we vary a sine component while keeping all other sines/cosines constant, the resultant waveform varies in response to that sine component. Similarly, in our case when we are computing the change with respect to a particular variable ($\frac{\partial}{\partial{variable}}$), we treat other variables as constants.

n=1 n=2 n=3 Evaluation direction

We can think of a sequence of operations in terms of a tree structure. We start at the leaves (input), and with each operation step, we are stepping up a tree layer, until we reach the root (output). Let $n$ be the depth (number of steps) of the tree structure. So at $n=1$ we have the input (leaf nodes) and at $n=n$ we have the output (root).

The gradient evaluation sequence using the chain rule is kind of like the below: \(\begin{aligned} \frac{\partial{z}}{\partial{x}} = (\frac{\partial{z}}{\partial{h_{n-1}}}(\frac{\partial{h_{n-1}}}{\partial{h_{n-2}}}\cdots(\frac{\partial{h_2}}{\partial{h_1}}\frac{\partial{x}}{\partial{x}}))\cdots) \end{aligned}\) where $h_n$ are intermediate expressions and $z$ is the output expression. Essentially, we are drilling down expressions to the most elementary component - variable, sum of 2 variables, product of 2 variables. After evaluating the resulting values, they are then used to evaluate more complicated compound expressions all the way to the final expression. This is why forward mode is also called “bottom-up” approach.

Example

$z = x^2 + yxy$

With symbolic differentiation,$\frac{\partial{z}}{\partial{x}} = 2x + y^2$, $\frac{\partial{z}}{\partial{y}} = 2yx$.

With auto differentiation, the derivatives of the most elementary operations such as adds, multiplication etc are stored. The derivatives values of composite functions are then reconstructed using these elementary “base case” results and also by applying the chain rule.

For this example, define the 3 elementary derivative results for sum and multiplication.

1. Variable
$$\begin{aligned} &f = x\\ &\frac{\partial{f}}{\partial{x}} = 1 \end{aligned}$$
2. Addition
$$\begin{aligned} &f = a + b\\ &\frac{\partial{f}}{\partial{x}} = \frac{\partial{a}}{\partial{x}} + \frac{\partial{b}}{\partial{x}} \\ \\ &f = a + b + c\\ &\frac{\partial{f}}{\partial{x}} = \frac{\partial{a}}{\partial{x}} + \frac{\partial{b}}{\partial{x}} + \frac{\partial{c}}{\partial{x}} \end{aligned}$$
3. Multiplication
$$\begin{aligned} &f = ab\\ &\frac{\partial{f}}{\partial{x}} = \frac{\partial{a}}{\partial{x}}b + a\frac{\partial{b}}{\partial{x}}\\ \\ &f = abc\\ &\frac{\partial{f}}{\partial{x}} = bc\frac{\partial{a}}{\partial{x}} + ac\frac{\partial{b}}{\partial{x}} + ab\frac{\partial{c}}{\partial{x}} \end{aligned}$$

Breaking down the expression to the following subexpressions, \(\begin{aligned} h_1 = x * x\\ h_2 = y * x\\ h_3 = h_2 * y\\ h_4 = h_1 + h_3\\ \end{aligned}\)

I will slightly abuse notation below. I define $\dot{h}$ to refer to the partial derivative of $h$ with respect to the current variable being worked on. And I remember that variables $x$, and $y$ are numerical values assigned as 6 & 7.

Partial $x$
$$\begin{aligned} \dot{x}=\frac{\partial{x}}{\partial{x}} &= 1, \dot{y}=\frac{\partial{y}}{\partial{x}} = 0\\ x &= 6,y = 7 \\ \dot{h_1} &= \dot{x}x +x\dot{x} = 12\\ \dot{h_2} &= y\dot{x} = 7(1) = 7\\ \dot{h_3} &= \dot{h_2}y = 7(7) = 49\\ \dot{h_4} &= \dot{h_1}+\dot{h_3} = 12+49 = 61\\ \end{aligned}$$ x y * * * + z (6,1) (7,0) (6,1) (6,1) (36,12) (6,1) (7,0) (7,0) (42,7) (294,49) (330,61)
Partial $y$
$$\begin{aligned} \dot{x}=\frac{\partial{x}}{\partial{y}} &= 0, \dot{y}=\frac{\partial{y}}{\partial{y}} = 1\\ x &= 6,y = 7 \\ \dot{h_1} &= 0\\ \dot{h_2} &= \dot{y}x = 6\\ \dot{h_3} &= \dot{h_2}y+h_2\dot{y} = 6(7)+42(1) = 84\\ \dot{h_4} &= \dot{h_1}+\dot{h_3} = 84\\ \end{aligned}$$ x y * * * + z (6,0) (7,1) (6,0) (6,0) (36,0) (6,0) (7,1) (7,1) (42,6) (294,84) (330,84)

We now verify that the results obtained are the same as when we plug in the numbers with symbolic differentiation. $\frac{\partial{z}}{\partial{x}} = 2x + y^2=2(6) + 7^2 = 61$, $\frac{\partial{z}}{\partial{y}} = 2yx = 2(7)(6) = 84$ which is what we get from the forward mode autodiff.

Code

The Python and Julia code implementation for the above example is as below. Note that the implementation for the Python version is based on evaluating 2 arguments, whereas for Julia is for multiple arguments. Julia implementation is a bit unique as it utilises metaprogramming. For instance, $Expr(:call,:+,a,b,c)$ constructs an expression of $a+b+c$ and calling $eval$ on that expression evaluates the sum.

Recall the definition of the Jacobian

Jacobian definition
$$\begin{equation} \boldsymbol{J}=\frac{\partial{\boldsymbol{f}}}{\partial{\boldsymbol{x}}}=\begin{bmatrix} \frac{\partial{f_1}}{\partial{x_1}} & \cdots & \frac{\partial{f_1}}{\partial{x_n}}\\ \vdots & \ddots & \vdots\\ \frac{\partial{f_n}}{\partial{x_1}} & \cdots & \frac{\partial{f_n}}{\partial{x_n}}\\ \end{bmatrix}\\ where \ \frac{\partial{f_i}}{\partial{x_j}} \ corresponds \ to \ the \ i^{th} \ row, \ j^{th} \ column\ entry\ of\ the\ Jacobian. \end{equation}$$

Given $n$ independent input variables $x_1,x_2 \cdots x_n$ and $m$ dependent output variables $z_1,z_2 \cdots z_m$, we can construct the Jacobian as \(\begin{equation} \boldsymbol{J}=\frac{\partial{\boldsymbol{z}}}{\partial{\boldsymbol{x}}}=\begin{bmatrix} \frac{\partial{z_1}}{\partial{x_1}} & \cdots & \frac{\partial{z_1}}{\partial{x_n}}\\ \vdots & \ddots & \vdots\\ \frac{\partial{z_m}}{\partial{x_1}} & \cdots & \frac{\partial{z_m}}{\partial{x_n}}\\ \end{bmatrix}\\ \end{equation}\)

One forward pass for a variable computes one column of the Jacobian. Hence, the full Jacobian can be computed in $n$ forward passes.
The forward mode (AD) is more efficient when the number of input variables $n$ is lower than the outputs $m$. $f:\mathbb{R}^n \to \mathbb{R}^m, \ n \ll m $. The extreme case would be a one-to-many mapping, $f:\mathbb{R} \to \mathbb{R}^m$, where only 1 forward pass is required to compute gradients to all outputs. \(\begin{equation} x \to \fbox{Forward AD(f(x))} \to \begin{bmatrix} \frac{\partial{z_1}}{\partial{x}} \\ \vdots \\ \frac{\partial{z_m}}{\partial{x}}\\ \end{bmatrix}\\ \end{equation}\)

When $n \gg m $, the reverse mode AD is preferable.

Reverse mode

The reverse mode auto differentiation (AD) first evaluates the full expression in the forward pass, then computes gradients in the backwards pass.

n=1 n=2 n=3 Evaluates numerical values of subexpressions (Forward step)
n=1 n=2 n=3 Evaluates and propagates gradients to leaves (Backwards step)

Thinking in terms of a tree, we start at the leaves (input), progress all the way up to the root (output) in the forward pass. Then, in the backward pass, we evaluate gradients at the root, propagate gradients backwards down towards each of the leaves (input). If this reminds you of backpropagation, it is because the backpropagation algorithm is a kind of reverse mode AD.

The gradient evaluation sequence using the chain rule during the backwards step is kind of like the below: \(\begin{aligned} \frac{\partial{z}}{\partial{x}} = (\cdots((\frac{\partial{z}}{\partial{z}}\frac{\partial{z}}{\partial{h_{n-1}}})\frac{\partial{h_{n-1}}}{\partial{h_{n-2}}})\cdots\frac{\partial{h_2}}{\partial{x}}) \end{aligned}\) where $h_n$ are intermediate expressions and $z$ is the output expression. The gradients are computed from the output expression (root) backwards towards the variables. This is why the reverse mode is also called “top-down” approach.

Example

$z = x(x + y) + yxy$

With symbolic differentiation,$\frac{\partial{z}}{\partial{x}} = 2x + y + y^2$, $\frac{\partial{z}}{\partial{y}} = x + 2yx$.

For autodiff, we first break the expression down to the following subexpressions, \(\begin{aligned} h_1 = x + y\\ h_2 = x * h_1\\ h_3 = x * y\\ h_4 = y * h_3\\ h_5 = h_2 + h_4\\ \end{aligned}\)

Again, I will slightly abuse notation below. I define $\bar{h_i}$ to refer to the partial derivative of output $z$ with respect to the intermediate variable $h_i$, $\bar{h_i}=\frac{\partial{z}}{\partial{h_i}}$. And I remember that variables $x$, and $y$ are numerical values assigned as 6 & 7.

Forward step
$$\begin{aligned} x &= 6,y = 7 \\ h_1 &= 6 + 7 = 13\\ h_2 &= 6(13) = 78\\ h_3 &= 6(7) = 42\\ h_4 &= 7(42) = 294\\ h_5 &= 78 + 294 = 372\\ \end{aligned}$$ x y * + * * + z v:6 ∂:0 v:7 ∂:0 v:6 v:13 v:78 v:6 v:7 v:7 v:42 v:294 v:372 v:372
Backward step
$$\begin{aligned} x &= 6,y = 7 \\ \bar{h_5} &= \frac{\partial{z}}{\partial{z}} = 1\\ \bar{h_4} &= \bar{h_5}\frac{\partial{h_5}}{\partial{h_4}} = 1(1) = 1\\ \bar{h_2} &= \bar{h_5}\frac{\partial{h_5}}{\partial{h_2}} = 1(1) = 1\\ \bar{h_3} &= \bar{h_4}\frac{\partial{h_4}}{\partial{h_3}} = 1(7) = 7\\ \frac{\partial{z}}{\partial{y}} &= \bar{h_4}\frac{\partial{h_4}}{\partial{y}} = 1(42) = 42\\ \frac{\partial{z}}{\partial{x}} &= \bar{h_3}\frac{\partial{h_3}}{\partial{x}} = 7(7) = 49\\ \frac{\partial{z}}{\partial{y}} &= \bar{h_3}\frac{\partial{h_3}}{\partial{y}} = 7(6) = 42\\ \bar{h_1} &= \bar{h_2}\frac{\partial{h_2}}{\partial{h_1}} = 1(6) = 6\\ \frac{\partial{z}}{\partial{x}} &= \bar{h_2}\frac{\partial{h_2}}{\partial{x}} = 1(13) = 13\\ \frac{\partial{z}}{\partial{x}} &= \bar{h_1}\frac{\partial{h_1}}{\partial{x}} = 6(1) = 6\\ \frac{\partial{z}}{\partial{y}} &= \bar{h_1}\frac{\partial{h_1}}{\partial{y}} = 6(1) = 6\\ Total \ \frac{\partial{z}}{\partial{x}} &= 49 + 13 + 6 = \fbox{68}\\ Total \ \frac{\partial{z}}{\partial{y}} &= 42 + 42 + 6 = \fbox{90}\\ \end{aligned}$$ x y * + * * + z v:372 ∂:1 v:6 ∂:68 v:7 ∂:90 ∂:13 ∂:6 ∂:1 ∂:49 ∂:42 ∂:42 ∂:7 ∂:1 ∂:1 ∂:6 ∂:6

We now verify that the results obtained are the same as when we plug in the numbers with symbolic differentiation. $\frac{\partial{z}}{\partial{x}} = 2x + y + y^2=2(6) + 7 +7 ^2 = 68$, $\frac{\partial{z}}{\partial{y}} = x + 2yx = 6 + 2(7)(6) = 90$ which is what we get from the reverse mode autodiff.

Code

The Python and Julia code implementation for the above example is as below.

The reverse mode (AD) is more efficient when the number of input variables $n$ is greater than the outputs $m$. $f:\mathbb{R}^n \to \mathbb{R}^m, \ n \gg m $. The extreme case would be a many-to-one mapping, $f:\mathbb{R}^n \to \mathbb{R}$. \(\begin{equation} x_1,x_2 \cdots x_n \to \fbox{Reverse AD(f(x))} \to \begin{bmatrix} \frac{\partial{z}}{\partial{x_1}} \ \cdots \ \frac{\partial{z}}{\partial{x_n}}\\ \end{bmatrix}\\ \end{equation}\) This is the case for neural networks, where it is very common to have many input variables, but only a single output is obtained e.g. loss function, classification. Hence, this is why backpropagation is the favoured method for computing gradients in deep learning world.

References

  1. https://en.wikipedia.org/wiki/Automatic_differentiation
  2. Automatic differentiation in machine learning: a survey