3-layer neural network
In this example, we have two inputs, \(x_1\) and \(x_2\), that go through three layers before producing the final output, consisting of \(y_1\) and \(y_2\). Each layer \(l\) of neurons:
This type of neural network, where information flows in only one direction - from the input layer to the output layer - is called a feedforward neural network.
The “training process”, then, consists of finding the optimal weights and biases in the network such that some cost function is minimized. The cost function can be seen as how poorly the network performs, so lower is better. The hope is that given enough training data and iterations, we can reduce the cost function to a minimum, which by extension means the network will do a pretty good job of predicting an output.
Below is some notation to formalize this idea.
But how do we adjust the weights and biases? A common method of doing this is via gradient descent. Since we want to minimize some cost function, gradient descent essentally finds the gradient of the function at some point, then moves the weights and biases in the opposite direction of the gradient. Keep in mind that gradient descent only finds a local minimum and not a global minimum.
Finding local minima via gradient descent
How to efficiently find the gradient of the cost function with respect to each of the input weights and biases is where backpropogation comes into play.
For each layer \(l\), we want to find the gradient of the cost function with respect to its weights and biases, or \(\begin{align}\frac{\partial C}{\partial W_l}\end{align}\) and \(\begin{align}\frac{\partial C}{\partial b_l}\end{align}\) respectively. The key intuition is that the derivatives of each layer can be calculated using the derivatives of the next. So by iterating backwards starting from the output layer, we can cache calculated values and reuse them on the next iteration.
The key ingredient of backpropogation is the chain rule of calculus. That is, if \(z = f(y)\) and \(y = f(x)\), then: \(\begin{align} \frac{\partial z}{\partial x} &= \frac{\partial z}{\partial y} \frac{\partial y}{\partial x}.\\ \end{align}\)
Recall the relationship between \(W_l\), \(z_l\), \(a_l\), and \(C\). \(W_l\) is left-multiplied with the output activation from the previous layer \(a_{l-1}\) to form \(z_l\). \(z_l\) is passed into our activation function \(\sigma\) to form \(a_l\). This process is repeated until the final activation \(a_L\) is passed to the cost function \(C\).
\[\begin{align} z_l &= W_la_{l-1}\\ a_l &= \sigma(z_l)\\ C(a_L) &= \frac{1}{2}(a_{L} - y)^2\\ \end{align}\]Deriving \(\begin{align}\frac{\partial C}{\partial W_L}\end{align}\) for output layer is thus: \(\begin{align} \frac{\partial C}{\partial W_L} &= \frac{\partial C}{\partial a_L}\frac{\partial a_L}{\partial z_L}\frac{\partial z_L}{\partial w_L}\\ &= \frac{\partial}{\partial a_{L}}\frac{1}{2}(a_{L} - y)^2 \odot \sigma'(z_L) \frac{\partial z_L}{\partial W_L}\\ \tag{1.1} &= (a_L - y) \odot \sigma'(z_L) \frac{\partial z_L}{\partial W_L}\\ \end{align}\)
where \(\odot\) is the element-wise product, or Hadamard product, as opposed to regular matrix product. The derivative of the sigmoid function \(\sigma'(x)\) is conveniently:
\[\begin{align} \sigma'(x) &= x(1 - x)\\ \end{align}\]Here we can simplify our notation by defining a new value, \(\delta_l\), as the partial derivative of the cost in terms of our weighted input \(z_l\):
\[\begin{align} \delta_l = \frac{\partial C}{\partial z_l} \end{align}\]So for our output layer \(L\):
\[\begin{align} \tag{2.1} \delta_L = \frac{\partial C}{\partial z_L} = (a_L - y) \odot \sigma'(z_L)\\ \end{align}\]We can now substitute \(\delta_L\) into Equation 1.1:
\[\begin{align} \frac{\partial C}{\partial W_l} &= \delta_L\frac{\partial z_L}{\partial W_L}\\ &= \delta_L\frac{\partial (W_La_{L-1})}{\partial W_L}\\ \tag{1.2} &= \delta_L(a_{L-1})^T \end{align}\]Observe that \(\delta_L\) and \(a_{L-1}\) are vectors of equal dimensions, so we take the transpose of the latter to get an inner product.
Intuitively, \(\delta_l\) represents the “error,” or how sensitive the output of the cost function is in terms of the current layer’s weighted input \(z_l\). If this value is large, that means the cost can be significantly reduced given a small change in the weighted input, so the weighted input \(z_l\) is pretty far off from the desired value. Conversely, a small \(\delta\) means we can’t reduce the cost much more, and thus it is close to the desired value.
For the hidden layers, the relationship between the weights and the final cost is a bit more implicit. But essentially it is a repeated application of the chain rule from some \(W_l\), through the network, and ending at the cost function.
\[\begin{align} \tag{3.1} \frac{\partial C}{\partial W_l} &= \frac{\partial C}{\partial a_L}\frac{\partial a_L}{\partial z_L}\frac{\partial z_L}{\partial W_L}\frac{\partial W_L}{\partial a_{L-1}}\frac{\partial a_{L-1}}{\partial z_{L-1}}\frac{\partial z_{L-1}}{\partial W_{L-1}} ... \frac{\partial a_{l+1}}{\partial z_{l+1}}\frac{\partial z_{l+1}}{\partial W_{l+1}}\frac{\partial W_{l+1}}{\partial a_l}\frac{\partial a_l}{\partial z_l}\frac{\partial z_l}{\partial W_l}\\ \end{align}\]Given our definition of the error term \(\delta_l\) above, we can substitute \(\begin{align}\delta_{l+1} = \frac{\partial C}{\partial z_{l+1}}\end{align}\):
\[\begin{align} \tag{3.2} \frac{\partial C}{\partial W_l} &= \delta_{l+1}\frac{\partial z_{l+1}}{\partial W_{l+1}}\frac{\partial W_{l+1}}{\partial a_l}\frac{\partial a_l}{\partial z_l}\frac{\partial z_l}{\partial W_l}\\ \tag{3.3} &= W_{l+1}^T\delta_{l+1}\frac{\partial a_l}{\partial z_l}\frac{\partial z_l}{\partial W_l}\\ \tag{3.4} &= W_{l+1}^T\delta_{l+1}\frac{\partial \sigma(W_l a_{l-1})}{\partial z_l}\frac{\partial z_l}{\partial W_l}\\ \tag{3.5} &= W_{l+1}^T\delta_{l+1}\odot \sigma'(W_la_{l-1}) \frac{\partial z_l}{\partial W_l}\\ \end{align}\]Alas, we arrive at the key insight of backpropogation. If you compare Equations 3.5 and 3.1, you’ll notice that the entire term \(\begin{align}W_{l+1}^T\delta_{l+1}\odot \sigma'(W_la_{l-1})\end{align}\) is simply \(\begin{align}\frac{\partial C}{\partial z_l}\end{align}\), which is exactly the definition of the error term \(\delta_l\)!
\[\tag{4.1}\begin{align}\delta_l = W_{l+1}^T\delta_{l+1}\odot \sigma'(W_la_{l-1})\end{align}\]What this means is that each error term \(\delta_l\) is defined in terms of its successor \(\delta_{l+1}\). So instead of calculating the entire gradient directly, we can start at the output layer and move backward, saving the error each time so it can be used by the previous layer on the next iteration. In algorithm design, this technique is known as dynamic programming, and it allows us to avoid unnecessary computations.
Finishing up our derivation and substituting in \(\delta_l\):
\[\begin{align} \tag{3.6} \frac{\delta C}{\delta W_l}&= \delta_l \frac{\partial z_l}{\partial W_l}\\ \tag{3.7} &= \delta_l \frac{\partial (W_La_{l-1})}{\partial W_l}\\ \tag{3.8} &= \delta_l (a_{l-1})^T\\ \end{align}\]Last thing to note is that the use of the activations \(a_{l-1}\) means that during the forward pass, these values must be calculated in advance and cached.
Using the above derivation, finding the bias update is much simpler.
\[\begin{align} \frac{\partial C}{\partial b_l} &= \frac{\partial C}{\partial z_l}\frac{\partial z_l}{\partial b_l}\\ &= \delta_l \frac{\partial(W_la_{l-1} + b_l)}{\partial b_l}\\ \tag{5.1} &= \delta_l \end{align}\]Now that we have all the pieces, we have a algorithm for training a neural network using backpropogation! To summarize:
And that is it! Hopefully that gives us some insight into a key piece of machinery that powers modern deep learning. In the next part of this series, we’ll implement a simple neural network in Python using the backpropogation algorithm we just derived.