Neural Networks and Backpropagation: Part Two

This article continues from Neural Networks and Backpropagation: Part One and finishes working through the example of creating a neural network to compute an OR gate.  If you have yet to read through the article and the corresponding Jupyter notebook you should take the time now as it is important for understanding the rest of the working.

light_globeDon’t forget to grab the jupyter notebook for this article from github here or by cloning:
git clone

Where we left off

We have just completed the first pass of forward propagation through the network, computing the activations of each of the nodes in the network, now it is time to execute backpropagation through the network.


What is Backpropagation?

The process of backpropagation computes the amount that each weight in the network contributes to the final error produced by the network.  This process allows us to determine how much we need to adjust the weights to produce an improvement in the network performance.  What we are aiming to do is determine the ideal value for each of the weights in the network, such that we get the best performance  or lowest error score for the network.  The backpropagation process involves determining the error at the output of the network and projecting back through each of the layers and nodes in the network.

So the first thing we need to do is select a function that determines the error for the network.

Network Cost Function

For the OR gate we are going to select the least squared error cost function:

E = \frac{1}{2}\sum{(target-output)^2}

E = \frac{1}{2}\sum{(y_{train} - a_o)^2}

Looking at this function in more detail, the reason for squaring the difference in the output and the target is simply to remove any effect of positive or negative errors from the computation.  It doesn’t matter if we are +0.5 or -0.5 away from the target, the error is the same: 0.5.  The \frac{1}{2} is a constant multiplier that makes differentiating the cost function easier.  For future reference, the difference of the cost function with respect to the training data is:

\frac{\partial{E}}{\partial{a_o}} = (a_o - y_{train})

In our example y_{train} is the target output of the or gate.  As i_1 and i_2 are 0 and 1, then y_{train} = 1.

Referring back to Part One the output of the network:

a_o = 0.512


E = \frac{1}{2}\sum{(y_{train} - a_o)^2} =\frac{1}{2}(1 - 0.512)^2 = 0.119072

The change in error with respect to the output of the network:

\frac{\partial{E}}{\partial{a_o}} = (a_o - y_{train}) = (0.512 - 1) = -0.488

Output Layer Backpropagation

Now that we have the error at the output, we need to project it back to determine the error component at h_o.  In order to do this we need the chain rule of differentiation:


With this simple rule when we calculate how much every weight and node in the network contributes to the error.  So, calculating the error with respect to h_o:



Now we already have the first half of the above equation in \frac{\partial{E}}{\partial{a_o}} and what we need is the rate of change of the linearity of the output node with respect to the input.  Recalling from Part One we are using the Sigmoid function as the linearity.  We will not cover the proof of the derivative of the sigmoid function in this article as it is reasonably straight forward to determine; but this derivative is exactly what we need, the rate of change of linearity. So:

g(z) = \frac{1}{1 + e^{-z}}

\frac{\partial{a_o}}{\partial{h_o}} = \frac{\partial{g}}{\partial{z}} = \sigma'(z)= g(z)(1 - g(z))

We can now combine the two halves of the chain rule to determine the contibution of h_o to the error:

\frac{\partial{E}}{\partial{h_o}}=(a_o - y_{train})g(h_o)(1 - g(h_o))

Recalling from Part One that a_o = 0.512 for i_1 = 1, i_2 = 0 we can calculate the output layer contribution to error:

\frac{\partial{a_o}}{\partial{h_o}} = (0.512 - 1)*0.512(1 - 0.512) = -0.122

Using the same chain rule of differentiation we can compute the error attributing to each of the weights:


The only part of this equation we are missing is \frac{\partial{h_o}}{\partial{w_5}}.  Again from Part One:

h_o = a_1w_5 + a_2w_6 + b_3


\frac{\partial{h_o}}{\partial{w_5}} = a_1 and

\frac{\partial{E}}{\partial{w_5}}=(a_o - y_{train})g(h_o)(1 - g(h_o)) * a_1

\frac{\partial{E}}{\partial{w_5}}=(0.512 - 1)*0.512(1 - 0.512) * 0.69 = -0.084

Repeating the same pattern, let’s repeat the calculation for the other weight and bias term in the output layer:


Again from Neural Networks and Backpropagation: Part One:

h_o = a_1w_5 + a_2w_6 + b_3


\frac{\partial{h_o}}{\partial{w_6}} = a_2 and

\frac{\partial{E}}{\partial{w_6}}=(a_o - y_{train})g(h_o)(1 - g(h_o)) * a_1

\frac{\partial{E}}{\partial{w_6}}=(0.512 - 1)*0.512(1 - 0.512) * 0.71 = -0.087

Finally the error with respect to the bias:



\frac{\partial{h_o}}{\partial{b_3}} = 1 and

\frac{\partial{E}}{\partial{b_3}}=(a_o - y_{train})g(h_o)(1 - g(h_o)) * 1

\frac{\partial{E}}{\partial{b_3}}=(0.512 - 1)*0.512(1 - 0.512) = -0.122

Now we have completed the back propagation process for the output layer and weights.  So, moving onto the hidden layer we are essentially repeating the same process.

Hidden Layer Backpropagation


Repeating chain rule differentiation we can determine the amount of error attributed by w_1, w_2, w_3, w_4, b_1 and b_2.


Expanding on \frac{\partial{E}}{\partial{a_1}}

\frac{\partial{E}}{\partial{a_1}} = \frac{\partial{E}}{\partial{h_o}}\frac{\partial{h_o}}{\partial{a_1}}

From the output layer backpropagation:

\frac{\partial{E}}{\partial{h_o}} = (a_o - y_{train})g(h_o)(1 - g(h_o))

From this point on we will refer to \frac{\partial{E}}{\partial{h_o}} as \delta_o.  Try to keep in mind that \delta_o is the product of the change in error with respect to the output activation a_o and the derivative of the Sigmoid function \sigma'.

Continuing with determining the components of the chain rule for \frac{\partial{E}}{\partial{a_1}}:

\frac{\partial{h_o}}{\partial{a_1}} = \frac{\partial{}}{\partial{a_1}}(b_3 + a_0w_5 + a_1w_6) = w_5


\frac{\partial{E}}{\partial{a_1}} = (a_o - y_{train})g(h_o)(1 - g(h_o))w_5

As with all differentials of activations with respect to the input

\frac{\partial{a}}{\partial{h}} = g(h)(1 - g(h))

\frac{\partial{a_1}}{\partial{h_1}} = g(h_1)(1 - g(h_1))


\frac{\partial{h_1}}{\partial{w_1}}= \frac{\partial{}}{\partial{w_1}} (b_2 + i_1w_1 + i_2w_3) = i_1

So putting it all together:


\frac{\partial{E}}{\partial{w_1}}= \delta_ow_5\sigma'(h_1)i_1

The process described above forms the pattern for deriving all of the error components for the hidden layer weights.  The next obvious weight is w_3 which only differs from \frac{\partial{E}}{\partial{w_1}} in that the input is different, so:

\frac{\partial{E}}{\partial{w_3}}= \delta_ow_5\sigma'(h_1)i_2

Looking at the bias weight b_2 which is connected to h_1:

\frac{\partial{h_1}}{\partial{b_1}} = 1

\therefore \frac{\partial{E}}{\partial{w_1}}= \delta_ow_5\sigma'(h_1)

Now let’s repeat the process to compute \frac{\partial{E}}{\partial{w_2}} , \frac{\partial{E}}{\partial{w_4}} and \frac{\partial{E}}{\partial{b_1}}:


Again, from the output layer and as per above:

\frac{\partial{E}}{\partial{a_2}} = \frac{\partial{E}}{\partial{h_o}}\frac{\partial{h_o}}{\partial{a_2}} = (a_o - y_{train})g(h_o)(1 - g(h_o))w_6


\frac{\partial{E}}{\partial{w_2}} = (a_o - y_{train})g(h_o)(1 - g(h_o))w_6i_1

We can repeat this process for \frac{\partial{E}}{\partial{w_4}}

\frac{\partial{E}}{\partial{w_4}} = (a_o - y_{train})g(h_o)(1 - g(h_o))w_6i_2

and :

\frac{\partial{E}}{\partial{b_2}} = (a_o - y_{train})g(h_o)(1 - g(h_o))w_6

Calculating the corresponding values:

\frac{\partial{a_o}}{\partial{h_o}} = \delta_o = (a_o - y_{train})g(h_o)(1 - g(h_o)) = -0.122

The output layer values as above:

\frac{\partial{E}}{\partial{b_3}} = -0.122

\frac{\partial{E}}{\partial{w_5}} =-0.084

\frac{\partial{E}}{\partial{w_6}} = -0.087

\frac{\partial{E}}{\partial{w_1}} = \delta_ow_5\sigma'(h_1)i_1 = -0.122(0.01)(1) = -0.00122

\frac{\partial{E}}{\partial{w_2}} = \delta_ow_6\sigma'(h_2)i_1 = -0.122(0.02)(1) = -0.00244

\frac{\partial{E}}{\partial{w_3}} = \delta_ow_5\sigma'(h_1)i_2 = 0

\frac{\partial{E}}{\partial{w_4}} = \delta_ow_6\sigma'(h_2)i_2 = 0

\frac{\partial{E}}{\partial{b_2}} = \delta_ow_5 = -0.00122

 \frac{\partial{E}}{\partial{b_1}} = \delta_ow_6 = -0.00244

Generalised Form

Reviewing the details of the backpropagation process above it can be observed that there is a general pattern to the process.  Let’s say that we are completing the process for layer l and l + 1 is the next layer in the network closer to the output layer.  We can calculate:

\frac{\partial{E}}{\partial{h_l}} = \delta_l = \delta_{l+1}w_{l+1}\sigma'(h_l)

\frac{\partial{E}}{\partial{w_l}} =\delta_la_{l-1}

Where: a_{l-1} is the activation values of the layer before the current layer l.

In coding this process through a number of layers we can now easily loop by applying the generalised form.

Update Weights

Now that we have computed each of the derivatives of error with respect to the weights, we can update the values for the weights, hopefully to reduce the error produced by the network.  To do this we need to introduce another term, the learning rate \eta; this is how much of the derivative is to be applied during the update to the weights.  This is an important term in neural networks: a learning rate that is too small will lead to a network that is very slow to train, while a large value can cause instability in the network which may never find a solution.  The weights are updated as follows:

W \leftarrow W - \eta\frac{\partial{E}}{\partial{W}}

where $\leftarrow$ indicates a simultaneous update of all weights in the network.

Training Process

The training process for a network is essentially a repetitious loop through the forward propagation, backpropagation and weight update processes; adjusting the weights until hopefully a minimum error is achieved.  This is a very simplistic view and there are a number of technical aspects that must be correct for training to be successful, or can be executed to speed up the process.  The details of training are being left for a later post as they lie outside the exact workings of backpropagation.

In Closing

To re-iterate, this process may take a few reads to understand and get a good grasp on.  The effort is well worth it, particularly if trying to debug issues with training neural networks.  I also recommend that you read the jupyter notebook that partners this article as it demonstrates the practical applications of this working.

light_globeDon’t forget to grab the jupyter notebook for this article from github here or by cloning:
git clone


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: