# Neural Networks and Backpropagation: Part One

Lately I’ve been implementing my own neural network code base.  I’ve had a play around with some of the common libraries such as Lasagne and TensorFlow and while these libraries are great, it is important to understand the content itself.  Firstly is the thorough understanding of what exactly the network is doing; which in my opinion is certainly worth the effort!  At times using pre-existing libraries is lot like a black box, where you are not sure of what exactly it is doing but are pretty confident it is working correctly.  For your standard use cases, the black box can be ok, but you really should have an understanding of what the library is doing.  The second and one of the most important reasons, is that you have complete control over your implementation and have the ability to add unique / remove / modify unique features features and network architectures as you see fit.  This is not as easily done in pre-written libraries as you may be constrained by a particular framework / need a thorough understanding of a potentially large and broad code base.

Don’t forget to grab the jupyter notebook for this article from github here or by cloning:
git clone git@github.com:tinhatben/or_gate.git

## Approach

For the purposes of this article we are going to examine a simple neural network that computes an OR gate.  In computer science / engineering an OR gate is one of the most common logic structures.  An OR gate takes two binary values (1 or 0) as an input and will return a 1 if at least one of the input binary values is 1.  The following truth table demonstrates the possible combinations of values

Input 1 Input 2 Output
0 0 0
0 1 1
1 0 1
1 1 1

We are going to construct a neural network that when fed with the inputs above, return the correct results and will do this from two perspectives.  Firstly we will implement the math by hand, calculating the various values along the way, then we will recreate these calculations in python.  Using this combined method we should get a good understanding of how to implement the network / complete backpropagation.

It should be noted that this post does not cover the intuitions behind neural networks as a concept.  It does not cover the typical analogy with biology regarding myelin sheaths, activation potentials etc.

One of the most common statements written in various backpropagation tutorials is don’t expect to understand the concept of process immediately.  The basics are reasonably intuitive but it takes a few reads / re-reads of the content to get a really thorough understanding and be confident in your implementation.  One of the reasons I am writing this article is to solidify my own understadning after reading the content over the last week.

# Part One: Forward Propagation

## Getting Started

The first step is to decide on an initial network architecture, now I say initial as your network performance will vary depending on the choices made and may require tweaking.  Having implemented a network, there may be specific reasons for changing the number of layers, nodes, bias units etc.  So in deciding the architecture for the OR gate we know two things:

1. As per the truth table above we provide the network with 2 input values i.e. 0, 0 or 0, 1 or 1, 0 or 1, 1
2. The network returns a single value 1 or 0

These two facts will help in the design of the network as they need to be accommodated.  Now quite often you can get very good network performance using a single hidden layer, so it is a good choice here as well.  One of the most commonly varied aspects of the network architecture that is varied is the number of hidden units in the hidden layer.  This will vary depending on the application and to be confident you have made the right selection you will just have to try some.  For now we are going to choose 2 hidden units, this is the same as the number of input units which is often an indicator that the network is capable of learning the training set.  We can visualise our network right now as the following:

The two inputs e.g. 1, 0 are denoted as $i_1$ and $i_2$, the hidden units are denoted as $h_1$ and $h_2$ while the output layer is denoted as $h_o$.  Other applications will probably have differing numbers of inputs and output units, while as stated before the number of hidden units should be actively investigated and varies with the application at hand.  Between each of the units are weights (w), these are the values that the neural network needs to determine in order to make accurate predictions.  The whole point of training a neural network is to determine the values for the weights that produce the best predictions.  Now there are two other units missing from our network right now, the bias units.  Remembering back to the post on Linear Regression the bias units act like the y-intercepts in our linear regression model.  These units adjust the “firing threshold” of the neurons by applying another set of weights to a constant input of 1.  Note that value of 1 is not unique to the OR gate problem, all bias inputs have a fixed input of 1 as this is how the offset is applied.

So now we have the additional weights for the bias units, to assist in clarity they are labelled $b_1$, $b_2$ and $b_3$.

## Initialising Weights

Now normally, when we initialise the weights of a network we typically want to use random values close to zero, say -0.5 to 0.5.  There are 2 reasons why we wish to do this:

1. Initialise with random values prevents any unintentional trends from occurring during training.  Weights with the same values may increase or decrease together in a pattern which would lead to errors.
2. Having values close to zero allows for smaller changes at the start of the training process.  Large changes in weight values can lead to an unstable network and may prevent convergence.

Initialising weights is a critical part of the network design process.  If we choose weights that are too large, then it may take a really long time during the training process to approach the optimal values (if at all).  We will examine this in more detail when we look at linearity functions.

For the purposes of this example we do not want to use random values for the weights but rather known values.  This will help in understanding the training process and back propagation.  We are going to start with the following values:

• $w_1 = 0.1$
• $w_2 = 0.2$
• $w_3 = 0.3$
• $w_4 = 0.4$
• $b_1 = 0.5$
• $b_2 = 0.5$ (Note this does differ from point 1 above.  In real applications b1 and b2 should have different random values)
• $w_5 = 0.01$
• $w_2 = 0.02$
• $b_3 = 0.03$
• $i_1 = 1$
• $i_1 = 0$

## Linearities

Up to now, we have selected the network topology, and have initialised the weights.  We now need to select a linearity or activation function for each of the layers.  The linearity function is applied to the sum of the product of the weights and the previous activation values; in the case of the hidden layer the previous activation is the input to the network.  This function will determine how the neuron “fires” at each stage.  For this example we are going to select the Sigmoid function however we could select other functions such as tanh, square or linear linear.

### The Sigmoid Function

$g(z) = \frac{1}{1 + e^{-z}}$

The Sigmoid function effectively caps the output of the neurons between 0 and 1, as can be seen in the graph below; as z approaches +infinity, the sigmoid approaches 1 and vice versa with -infinity and 0.  This is exactly what we want for a neural network that computes the OR gate.  We want the output to either be 0 or 1.  Using the Sigmoid function in the hidden layer also provides stability but preventing the activations from becoming too large or too small; capping them at 0 and 1.  Again, referring to the graph below but also to what we said previously about intialising weights around 0.  The sigmoid function given z = 0: g(z = 0) = 0.5.  This is half way between the capped values but is also at the point of greatest gradient within the function.  By initialising the weights close to 0, we are given the neurons a roughly equal chance of approaching 0 or 1 and with the high gradient allowing them to get there faster.

## Forward Propagation

Finally we can combine all of the work above to determine what the initial outputs of the network will be.  Firstly lets calculate the activations of the hidden layer:

Summing the previous activations (input values):

$h_1 = i_1w_1 + i_2w_3 + b_2$

$h_2 = i_1w_2 + i_2w_4 + b_1$

We can also represent these equations in matrix form:

$H = \begin{bmatrix} w_1 & w_3 & b_2 \\ w_2 & w_4 & b_1 \\ \end{bmatrix} \begin{bmatrix} i_1 \\ i_2 \\ 1 \\ \end{bmatrix}$

Substituting the values:

$h_1 = 1(0.1) + 0(0.3) + 0.5 = 0.6$

$h_2 = 1(0.2) + 0(0.4) + 0.5 = 0.7$

$H = \begin{bmatrix} 0.6 \\ 0.7 \\ \end{bmatrix}$

Now the activations $a_x$:

$a_1 = g(h_1) = 0.646$

$a_2 = g(h_2) = 0.668$

$A = \begin{bmatrix} 0.646 \\ 0.668 \\ \end{bmatrix}$

Now we can continue this process to calculate the activation value at the output neuron:

$h_o = a_1w_5 + a_2w_6 + b_3$

Substituting the values:

$h_o = 0.646(0.01) + 0.668(0.02) + 0.03 = 0.04982$

Again calculating the activations:

$a_o = g(0.04982) = 0.512$