# Linear Regression

Other than being useful in its own right, understanding and being able to implement linear regression is an important first stepping stone to understanding and executing some more complicated algorithms such as perceptrons and neural networks.  So let’s take that first step to bigger things!

Don’t forget to grab the jupyter notebook for this blog post from github here or by cloning:
git clone git@github.com:tinhatben/linear_regression.git

Machine learning problems in general fall into one of two different camps:

• regression problems involve the use of continuous data e.g. real numbers (1, 2, 3, 8, 1.2., 34.5 etc) and aim to make a model of the observed system such that predictions can be made which are also continuous in nature.
• classification problems involve use the of discrete data and attempts to group the information.  In classification problems, observations are of a particular type, in a particular group or they are not.

So obviously linear regression is a regression problem, and what we are trying to achieve is a straight line model to a set of continuous data such that we can make predictions for other input values.  As mentioned above linear regression is a setting stone towards perceptrons and neural networks, tools that are generally used to solve classification problems.  So we will be using regression techniques as a smaller part in solving larger classification problems.

Linear regression is also a very common tool for statisticians, and you can even execute basic linear regression in spreadsheets such as Excel or LibreOffice Calc.  In these programs linear regression models are called trend lines, if you are interested here are instructions on how to complete trend lines in:

One very important difference between spreadsheet trend lines and the linear regression techniques shown in this blog post is the ability to fit lines across more than 2 dimensions.  The techniques we use can be applied across n dimensions, not just 2!

## Revision: Equation of a Straight Line

Remembering back to high school maths, at least in Australia (fyi: we do not have middle school; upon completion of primary you graduate into high school), the equation of a straight line is defined as follows:

$y = mx + b$

Where:

$m$ is the gradient or slope and can be calculated as $m = \frac{\delta y}{\delta x}$ ; and

$\boldmath{b}$ is the y intercept (where the line intersects the y axis)

As an example $y = 10x + 20$ would produce the following line:

In linear regression, given a set of data such as that shown below we are trying to find the best values for $m$ and $b$ that gives the least error on the predictions.  So let’s try and find the linear model for the following data.

## The Hypothesis

We are trying to find the gradient and y-intercept of a straight line that best matches the observed data we are interested in.  So let’s define a hypothesis function to describe our best guess:

$h(x^{(i)}) = \theta_0 + \theta_1x^{(i)}$

This format should look quite familiar as it is essentially $y = mx + b$ but to simplify the function a little more we can define $x_0 = 1$ so:

$h(x^{(i)}) = \theta_0x_0^{(i)} + \theta_1x_1^{(i)} = \sum_{i=0}^n \theta x$

## Cost Function & Gradient Descent

Now in order to find the best fit line for the data, we need to know what the error or cost of the fit is.  To define the cost of the fit we will look at the distance between the hypothesis function for a given point and the observed value.  We will be using the least squares method of error estimation which examines the square of the difference between two values, so the cost function:

$J(\theta) = \frac{1}{2}\sum_{i=0}^n(h(x^{(i)}) - y^{(i)})^2$

is the sum of the square of differences between the guessed and observed value for each point in the training set.  The one-half fraction is included to make the gradient descent calculation easier later on, but is simply used as a scaling factor here.

The process of gradient descent is the means by which we determine the values of $\theta$ to minimise the cost function; during gradient descent we start with an initial guess for $\theta$ and take a step down the slope of the cost function and re-evaluate theta. This process is repeated until the minima of the cost function has been found and there are no more steps to be taken that can reduce the cost.  The image below graphically represents the steps taken in gradient descent.

derivative work: Zerodamage – This file was derived from  Gradient descent.png: , Public Domain, https://commons.wikimedia.org/w/index.php?curid=20569355

For each iteration of gradient descent we need to execute the following:

$\theta_j := \theta_j - \left(\frac{\alpha}{n}\right) \frac{d}{d\theta_j}J(\theta)$

Where:
$\frac{d}{d\theta_j} = (h(x^{(i)}) - y^{(i)})x^{(i)}$ and;

$\alpha$ is defined as the learning rate

The learning rate is the size of the step taken down the slope of the cost function, and needs to be balanced.  A large $\alpha$ will take large steps down the slope, while a smaller $\alpha$ will produce smaller steps.  The definition of $\alpha$ does more than just adjust the speed of convergence of gradient descent.  While a small $\alpha$ will take longer to converge, it can also get stuck at local minima.  Conversely values for $\alpha$ that are large will take bigger steps but may never converge, stepping over the minima each time and never actually reaching it.

The appropriate value for alpha will vary depending upon the data and the model, the github example uses a learning rate of 0.01 and converges after approximately 400 iterations.  After convergence we have a model and are able to make predictions using:

$h(x^{(i)}) = \theta_0 + \theta_1x^{(i)}$

As we can see the model is a reasonable fit for the data; it covers most of the points well, while not trying to fit all of the points exactly, allowing for generalisation.  In a later post we will discuss the characteristics of a good fit in more detail, but for the moment this is a good description of fit.

## Using Linear Regression to Fit Polynomial Models

Now we can also use the technique of gradient descent to determine polynomial models for the data.  Say we have a data set that looked far less linear, and more cubic e.g.

Now we want to fit a model that is cubic:

$h(\theta) = \theta_0x_0 + \theta_1x_1 + \theta_2x_1^2 + \theta_3x_1^3$

Firstly we need to set up the feature vector including the bias term $x_o$ :

$X = \begin{bmatrix} 1 & x_{11} & x_{11}^2 & x_{11}^3 \\ 1 & x_{12} & x_{12}^2 & x_{12}^3 \\ \vdots & \vdots & \vdots& \vdots\\ 1 & x_{1n} & x_{1n}^2 & x_{1n}^3\\ \end{bmatrix}$

To prevent the polynomial terms from dominating the fit of the model inappropriately we need to scale each of the terms.  There are a number of different scaling factors one could use, however the one we will select is as follows, subtracting the mean and dividing by the standard deviation (ignoring the first column of $X$):

$x = \frac{X - \mu_X}{\sigma_X}$

We will need to store the values for $latex\mu_X$ and $latex\sigma_X$ as these will be used to make predictions.

Now we can train the model using gradient descent as we have above, determining the appropriate values for theta and the model for the data.  This time instead of producing 2 values for $\theta$, we will determine 4.  Now, when we make a prediction we need to normalise the data first by subtracting $\mu_X$ and dividing by $\sigma_X$. As before, the prediction can be made using:

$h(x) = \theta'x$

## Coming Up

Logistic Regression & Assessing Model Fit