# Neural Networks and Backpropagation: Part Two

This article continues from Neural Networks and Backpropagation: Part One and finishes working through the example of creating a neural network to compute an OR gate.  If you have yet to read through the article and the corresponding Jupyter notebook you should take the time now as it is important for understanding the rest of the working.

Don’t forget to grab the jupyter notebook for this article from github here or by cloning:
git clone git@github.com:tinhatben/or_gate.git

## Where we left off

We have just completed the first pass of forward propagation through the network, computing the activations of each of the nodes in the network, now it is time to execute backpropagation through the network.

## What is Backpropagation?

The process of backpropagation computes the amount that each weight in the network contributes to the final error produced by the network.  This process allows us to determine how much we need to adjust the weights to produce an improvement in the network performance.  What we are aiming to do is determine the ideal value for each of the weights in the network, such that we get the best performance  or lowest error score for the network.  The backpropagation process involves determining the error at the output of the network and projecting back through each of the layers and nodes in the network.

So the first thing we need to do is select a function that determines the error for the network.

## Network Cost Function

For the OR gate we are going to select the least squared error cost function:

$E = \frac{1}{2}\sum{(target-output)^2}$

$E = \frac{1}{2}\sum{(y_{train} - a_o)^2}$

Looking at this function in more detail, the reason for squaring the difference in the output and the target is simply to remove any effect of positive or negative errors from the computation.  It doesn’t matter if we are +0.5 or -0.5 away from the target, the error is the same: 0.5.  The $\frac{1}{2}$ is a constant multiplier that makes differentiating the cost function easier.  For future reference, the difference of the cost function with respect to the training data is:

$\frac{\partial{E}}{\partial{a_o}} = (a_o - y_{train})$

In our example $y_{train}$ is the target output of the or gate.  As $i_1$ and $i_2$ are 0 and 1, then $y_{train} = 1$.

Referring back to Part One the output of the network:

$a_o = 0.512$

So,

$E = \frac{1}{2}\sum{(y_{train} - a_o)^2} =\frac{1}{2}(1 - 0.512)^2 = 0.119072$

The change in error with respect to the output of the network:

$\frac{\partial{E}}{\partial{a_o}} = (a_o - y_{train}) = (0.512 - 1) = -0.488$

## Output Layer Backpropagation

Now that we have the error at the output, we need to project it back to determine the error component at $h_o$.  In order to do this we need the chain rule of differentiation:

$\frac{\partial{x}}{\partial{z}}=\frac{\partial{x}}{\partial{y}}\frac{\partial{y}}{\partial{z}}$

With this simple rule when we calculate how much every weight and node in the network contributes to the error.  So, calculating the error with respect to $h_o$:

$\frac{\partial{E}}{\partial{h_o}}=\frac{\partial{E}}{\partial{a_o}}\frac{\partial{a_o}}{\partial{h_o}}$

Now we already have the first half of the above equation in $\frac{\partial{E}}{\partial{a_o}}$ and what we need is the rate of change of the linearity of the output node with respect to the input.  Recalling from Part One we are using the Sigmoid function as the linearity.  We will not cover the proof of the derivative of the sigmoid function in this article as it is reasonably straight forward to determine; but this derivative is exactly what we need, the rate of change of linearity. So:

$g(z) = \frac{1}{1 + e^{-z}}$

$\frac{\partial{a_o}}{\partial{h_o}} = \frac{\partial{g}}{\partial{z}} = \sigma'(z)= g(z)(1 - g(z))$

We can now combine the two halves of the chain rule to determine the contibution of $h_o$ to the error:

$\frac{\partial{E}}{\partial{h_o}}=(a_o - y_{train})g(h_o)(1 - g(h_o))$

Recalling from Part One that $a_o = 0.512$ for $i_1 = 1, i_2 = 0$ we can calculate the output layer contribution to error:

$\frac{\partial{a_o}}{\partial{h_o}} = (0.512 - 1)*0.512(1 - 0.512) = -0.122$

Using the same chain rule of differentiation we can compute the error attributing to each of the weights:

$\frac{\partial{E}}{\partial{w_5}}=\frac{\partial{E}}{\partial{a_o}}\frac{\partial{a_o}}{\partial{h_o}}\frac{\partial{h_o}}{\partial{w_5}}$

The only part of this equation we are missing is $\frac{\partial{h_o}}{\partial{w_5}}$.  Again from Part One:

$h_o = a_1w_5 + a_2w_6 + b_3$

Thus:

$\frac{\partial{h_o}}{\partial{w_5}} = a_1$ and

$\frac{\partial{E}}{\partial{w_5}}=(a_o - y_{train})g(h_o)(1 - g(h_o)) * a_1$

$\frac{\partial{E}}{\partial{w_5}}=(0.512 - 1)*0.512(1 - 0.512) * 0.69 = -0.084$

Repeating the same pattern, let’s repeat the calculation for the other weight and bias term in the output layer:

$\frac{\partial{E}}{\partial{w_6}}=\frac{\partial{E}}{\partial{a_o}}\frac{\partial{a_o}}{\partial{h_o}}\frac{\partial{h_o}}{\partial{w_6}}$

$h_o = a_1w_5 + a_2w_6 + b_3$

Thus:

$\frac{\partial{h_o}}{\partial{w_6}} = a_2$ and

$\frac{\partial{E}}{\partial{w_6}}=(a_o - y_{train})g(h_o)(1 - g(h_o)) * a_1$

$\frac{\partial{E}}{\partial{w_6}}=(0.512 - 1)*0.512(1 - 0.512) * 0.71 = -0.087$

Finally the error with respect to the bias:

$\frac{\partial{E}}{\partial{b_3}}=\frac{\partial{E}}{\partial{a_o}}\frac{\partial{a_o}}{\partial{h_o}}\frac{\partial{h_o}}{\partial{b_3}}$

Thus:

$\frac{\partial{h_o}}{\partial{b_3}} = 1$ and

$\frac{\partial{E}}{\partial{b_3}}=(a_o - y_{train})g(h_o)(1 - g(h_o)) * 1$

$\frac{\partial{E}}{\partial{b_3}}=(0.512 - 1)*0.512(1 - 0.512) = -0.122$

Now we have completed the back propagation process for the output layer and weights.  So, moving onto the hidden layer we are essentially repeating the same process.

## Hidden Layer Backpropagation

Repeating chain rule differentiation we can determine the amount of error attributed by $w_1, w_2, w_3, w_4, b_1$ and $b_2$.

$\frac{\partial{E}}{\partial{w_1}}=\frac{\partial{E}}{\partial{a_1}}\frac{\partial{a_1}}{\partial{h_1}}\frac{\partial{h_1}}{\partial{w_1}}$

Expanding on $\frac{\partial{E}}{\partial{a_1}}$

$\frac{\partial{E}}{\partial{a_1}} = \frac{\partial{E}}{\partial{h_o}}\frac{\partial{h_o}}{\partial{a_1}}$

From the output layer backpropagation:

$\frac{\partial{E}}{\partial{h_o}} = (a_o - y_{train})g(h_o)(1 - g(h_o))$

From this point on we will refer to $\frac{\partial{E}}{\partial{h_o}}$ as $\delta_o$.  Try to keep in mind that $\delta_o$ is the product of the change in error with respect to the output activation $a_o$ and the derivative of the Sigmoid function $\sigma'$.

Continuing with determining the components of the chain rule for $\frac{\partial{E}}{\partial{a_1}}$:

$\frac{\partial{h_o}}{\partial{a_1}} = \frac{\partial{}}{\partial{a_1}}(b_3 + a_0w_5 + a_1w_6) = w_5$

So,

$\frac{\partial{E}}{\partial{a_1}} = (a_o - y_{train})g(h_o)(1 - g(h_o))w_5$

As with all differentials of activations with respect to the input

$\frac{\partial{a}}{\partial{h}} = g(h)(1 - g(h))$

$\frac{\partial{a_1}}{\partial{h_1}} = g(h_1)(1 - g(h_1))$

Now:

$\frac{\partial{h_1}}{\partial{w_1}}= \frac{\partial{}}{\partial{w_1}} (b_2 + i_1w_1 + i_2w_3) = i_1$

So putting it all together:

$\frac{\partial{E}}{\partial{w_1}}=\frac{\partial{E}}{\partial{a_1}}\frac{\partial{a_1}}{\partial{h_1}}\frac{\partial{h_1}}{\partial{w_1}}$

$\frac{\partial{E}}{\partial{w_1}}= \delta_ow_5\sigma'(h_1)i_1$

The process described above forms the pattern for deriving all of the error components for the hidden layer weights.  The next obvious weight is $w_3$ which only differs from $\frac{\partial{E}}{\partial{w_1}}$ in that the input is different, so:

$\frac{\partial{E}}{\partial{w_3}}= \delta_ow_5\sigma'(h_1)i_2$

Looking at the bias weight $b_2$ which is connected to $h_1$:

$\frac{\partial{h_1}}{\partial{b_1}} = 1$

$\therefore \frac{\partial{E}}{\partial{w_1}}= \delta_ow_5\sigma'(h_1)$

Now let’s repeat the process to compute $\frac{\partial{E}}{\partial{w_2}}$ , $\frac{\partial{E}}{\partial{w_4}}$ and $\frac{\partial{E}}{\partial{b_1}}$:

$\frac{\partial{E}}{\partial{w_2}}=\frac{\partial{E}}{\partial{a_2}}\frac{\partial{a_2}}{\partial{h_2}}\frac{\partial{h_2}}{\partial{w_2}}$

Again, from the output layer and as per above:

$\frac{\partial{E}}{\partial{a_2}} = \frac{\partial{E}}{\partial{h_o}}\frac{\partial{h_o}}{\partial{a_2}} = (a_o - y_{train})g(h_o)(1 - g(h_o))w_6$

So:

$\frac{\partial{E}}{\partial{w_2}} = (a_o - y_{train})g(h_o)(1 - g(h_o))w_6i_1$

We can repeat this process for $\frac{\partial{E}}{\partial{w_4}}$

$\frac{\partial{E}}{\partial{w_4}} = (a_o - y_{train})g(h_o)(1 - g(h_o))w_6i_2$

and :

$\frac{\partial{E}}{\partial{b_2}} = (a_o - y_{train})g(h_o)(1 - g(h_o))w_6$

Calculating the corresponding values:

$\frac{\partial{a_o}}{\partial{h_o}} = \delta_o = (a_o - y_{train})g(h_o)(1 - g(h_o)) = -0.122$

The output layer values as above:

$\frac{\partial{E}}{\partial{b_3}} = -0.122$

$\frac{\partial{E}}{\partial{w_5}} =-0.084$

$\frac{\partial{E}}{\partial{w_6}} = -0.087$

$\frac{\partial{E}}{\partial{w_1}} = \delta_ow_5\sigma'(h_1)i_1 = -0.122(0.01)(1) = -0.00122$

$\frac{\partial{E}}{\partial{w_2}} = \delta_ow_6\sigma'(h_2)i_1 = -0.122(0.02)(1) = -0.00244$

$\frac{\partial{E}}{\partial{w_3}} = \delta_ow_5\sigma'(h_1)i_2 = 0$

$\frac{\partial{E}}{\partial{w_4}} = \delta_ow_6\sigma'(h_2)i_2 = 0$

$\frac{\partial{E}}{\partial{b_2}} = \delta_ow_5 = -0.00122$

$\frac{\partial{E}}{\partial{b_1}} = \delta_ow_6 = -0.00244$

## Generalised Form

Reviewing the details of the backpropagation process above it can be observed that there is a general pattern to the process.  Let’s say that we are completing the process for layer $l$ and $l + 1$ is the next layer in the network closer to the output layer.  We can calculate:

$\frac{\partial{E}}{\partial{h_l}} = \delta_l = \delta_{l+1}w_{l+1}\sigma'(h_l)$

$\frac{\partial{E}}{\partial{w_l}} =\delta_la_{l-1}$

Where: $a_{l-1}$ is the activation values of the layer before the current layer $l$.

In coding this process through a number of layers we can now easily loop by applying the generalised form.

## Update Weights

Now that we have computed each of the derivatives of error with respect to the weights, we can update the values for the weights, hopefully to reduce the error produced by the network.  To do this we need to introduce another term, the learning rate $\eta$; this is how much of the derivative is to be applied during the update to the weights.  This is an important term in neural networks: a learning rate that is too small will lead to a network that is very slow to train, while a large value can cause instability in the network which may never find a solution.  The weights are updated as follows:

$W \leftarrow W - \eta\frac{\partial{E}}{\partial{W}}$

where $\leftarrow$ indicates a simultaneous update of all weights in the network.

## Training Process

The training process for a network is essentially a repetitious loop through the forward propagation, backpropagation and weight update processes; adjusting the weights until hopefully a minimum error is achieved.  This is a very simplistic view and there are a number of technical aspects that must be correct for training to be successful, or can be executed to speed up the process.  The details of training are being left for a later post as they lie outside the exact workings of backpropagation.

## In Closing

To re-iterate, this process may take a few reads to understand and get a good grasp on.  The effort is well worth it, particularly if trying to debug issues with training neural networks.  I also recommend that you read the jupyter notebook that partners this article as it demonstrates the practical applications of this working.

Don’t forget to grab the jupyter notebook for this article from github here or by cloning:
git clone git@github.com:tinhatben/or_gate.git