Linear Regression

Other than being useful in its own right, understanding and being able to implement linear regression is an important first stepping stone to understanding and executing some more complicated algorithms such as perceptrons and neural networks.  So let’s take that first step to bigger things!

light_globeDon’t forget to grab the jupyter notebook for this blog post from github here or by cloning:
git clone


Machine learning problems in general fall into one of two different camps:

  • regression problems involve the use of continuous data e.g. real numbers (1, 2, 3, 8, 1.2., 34.5 etc) and aim to make a model of the observed system such that predictions can be made which are also continuous in nature.
  • classification problems involve use the of discrete data and attempts to group the information.  In classification problems, observations are of a particular type, in a particular group or they are not.

So obviously linear regression is a regression problem, and what we are trying to achieve is a straight line model to a set of continuous data such that we can make predictions for other input values.  As mentioned above linear regression is a setting stone towards perceptrons and neural networks, tools that are generally used to solve classification problems.  So we will be using regression techniques as a smaller part in solving larger classification problems.

Linear regression is also a very common tool for statisticians, and you can even execute basic linear regression in spreadsheets such as Excel or LibreOffice Calc.  In these programs linear regression models are called trend lines, if you are interested here are instructions on how to complete trend lines in:

One very important difference between spreadsheet trend lines and the linear regression techniques shown in this blog post is the ability to fit lines across more than 2 dimensions.  The techniques we use can be applied across n dimensions, not just 2!

Revision: Equation of a Straight Line

Remembering back to high school maths, at least in Australia (fyi: we do not have middle school; upon completion of primary you graduate into high school), the equation of a straight line is defined as follows:

y = mx + b


m is the gradient or slope and can be calculated as m = \frac{\delta y}{\delta x} ; and

\boldmath{b} is the y intercept (where the line intersects the y axis)


As an example y = 10x + 20 would produce the following line:

straight line anatomy

In linear regression, given a set of data such as that shown below we are trying to find the best values for m and b that gives the least error on the predictions.  So let’s try and find the linear model for the following data.

uni vs sleep

The Hypothesis

We are trying to find the gradient and y-intercept of a straight line that best matches the observed data we are interested in.  So let’s define a hypothesis function to describe our best guess:

h(x^{(i)}) = \theta_0 + \theta_1x^{(i)}

This format should look quite familiar as it is essentially y = mx + b but to simplify the function a little more we can define x_0 = 1 so:

h(x^{(i)})  = \theta_0x_0^{(i)} + \theta_1x_1^{(i)} = \sum_{i=0}^n \theta x

Cost Function & Gradient Descent

Now in order to find the best fit line for the data, we need to know what the error or cost of the fit is.  To define the cost of the fit we will look at the distance between the hypothesis function for a given point and the observed value.  We will be using the least squares method of error estimation which examines the square of the difference between two values, so the cost function:

J(\theta) = \frac{1}{2}\sum_{i=0}^n(h(x^{(i)}) - y^{(i)})^2

is the sum of the square of differences between the guessed and observed value for each point in the training set.  The one-half fraction is included to make the gradient descent calculation easier later on, but is simply used as a scaling factor here.

Gradient Descent

The process of gradient descent is the means by which we determine the values of \theta to minimise the cost function; during gradient descent we start with an initial guess for \theta and take a step down the slope of the cost function and re-evaluate theta. This process is repeated until the minima of the cost function has been found and there are no more steps to be taken that can reduce the cost.  The image below graphically represents the steps taken in gradient descent.


Gradient descent.svg
By Gradient_descent.png: The original uploader was Olegalexandrov at English Wikipedia
derivative work: Zerodamage – This file was derived from  Gradient descent.png: , Public Domain,

For each iteration of gradient descent we need to execute the following:

\theta_j := \theta_j - \left(\frac{\alpha}{n}\right) \frac{d}{d\theta_j}J(\theta)

\frac{d}{d\theta_j} = (h(x^{(i)}) - y^{(i)})x^{(i)} and;

\alpha is defined as the learning rate

The learning rate is the size of the step taken down the slope of the cost function, and needs to be balanced.  A large \alpha will take large steps down the slope, while a smaller \alpha will produce smaller steps.  The definition of \alpha does more than just adjust the speed of convergence of gradient descent.  While a small \alpha will take longer to converge, it can also get stuck at local minima.  Conversely values for \alpha that are large will take bigger steps but may never converge, stepping over the minima each time and never actually reaching it.

The appropriate value for alpha will vary depending upon the data and the model, the github example uses a learning rate of 0.01 and converges after approximately 400 iterations.  After convergence we have a model and are able to make predictions using:

h(x^{(i)}) = \theta_0 + \theta_1x^{(i)}


As we can see the model is a reasonable fit for the data; it covers most of the points well, while not trying to fit all of the points exactly, allowing for generalisation.  In a later post we will discuss the characteristics of a good fit in more detail, but for the moment this is a good description of fit.

Using Linear Regression to Fit Polynomial Models

Now we can also use the technique of gradient descent to determine polynomial models for the data.  Say we have a data set that looked far less linear, and more cubic e.g.

polynomial_dataNow we want to fit a model that is cubic:

h(\theta) = \theta_0x_0  + \theta_1x_1 + \theta_2x_1^2 + \theta_3x_1^3

Firstly we need to set up the feature vector including the bias term x_o :

X = \begin{bmatrix} 1 & x_{11} & x_{11}^2 & x_{11}^3 \\ 1 & x_{12} & x_{12}^2 & x_{12}^3 \\ \vdots & \vdots & \vdots& \vdots\\ 1 & x_{1n} & x_{1n}^2 & x_{1n}^3\\ \end{bmatrix}

To prevent the polynomial terms from dominating the fit of the model inappropriately we need to scale each of the terms.  There are a number of different scaling factors one could use, however the one we will select is as follows, subtracting the mean and dividing by the standard deviation (ignoring the first column of X):

x = \frac{X - \mu_X}{\sigma_X}

We will need to store the values for $latex\mu_X$ and $latex\sigma_X$ as these will be used to make predictions.

Now we can train the model using gradient descent as we have above, determining the appropriate values for theta and the model for the data.  This time instead of producing 2 values for \theta, we will determine 4.  Now, when we make a prediction we need to normalise the data first by subtracting \mu_X and dividing by \sigma_X. As before, the prediction can be made using:

h(x) = \theta'x


Coming Up

Logistic Regression & Assessing Model Fit


One thought on “Linear Regression

  1. Pingback: Neural Networks and Backpropagation: Part One | tinhatben

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: