Principal Component Analysis Tutorial

Recently I have been reading up on Principal Component Analysis (PCA).  PCA is essentially a compression technique, providing a means of representing larger or high dimensional data sets with less information.  Using PCA you can take a large m x n matrix and provide a means of reconstructing the data with a smaller matrix a vector of principal components.

You can find the Python code for this post on github

If you have yet to prepare your development environment, please see this post

Before we get started on PCA, there are a couple of more basic statistics concepts that we need to ensure we understand first.    Assuming absolutely no prior knowledge:

Basic Statistics

Mean

The mean is simply the average value for the data set and is calculated as:

$\mu = \frac{1}{n}\sum_{k=1}^n x_k$

Where $n$ is the number of values in the sample and $x_k$ represents each of the individual values in the set.

When we think of the calculating the mean , we typically think of applying the calculation of a single set of 1-D array of data.  The mean however can be computed across a matrix

$\overline{\mu} = \frac{1}{n}\sum_{k=1}^n \overline{x_k}$

Standard Deviation

The standard deviation is the spread of the data set or the average distance of a data point from the mean of the data.  Data that has a low standard deviation is clumped around the mean, while data that has a large standard deviation is generally positioned further from the mean.

The standard deviation can be easily calculated using the mean, and again the standard deviation can also be calculated across a matrix :

$\sigma = \sqrt{\frac{\sum_{k=1}^n(x_k - \mu)^2}{n-1}}$

Interestingly, though the standard deviation is the average distance from the mean, the sum of differences is divided by ${n-1}$ and not $n$.  The reason for this lies in the fact that most data sets are a sample from a greater population, not the entire population itself and is thus more accurate.  If you happened to have data for the entire population, feel free to use $n$, however almost always you will use ${n-1}$&s=2\$.

One of the more useful statistical relationships is that in a normally distributed data set, 68.2% of the data lies within one standard deviation of the mean.

Normally distributed data

Another term, commonly associated with the standard deviation is the variance.  The variance is simply the square of the standard deviation and is denoted by $\sigma^2$.

Covariance

So now we know how to calculate the variance of a data set, let’s look at covariance.  The variance we calculated above is only for one dimensional data, even if we compute the variance on a 2D dimensional data set, each row or column will be considered independent of the other.  If we want to look at the variance of one data set with respect to another we use covariance.  Say, for example we have two data sets and Y we can compute the covariance:

$cov(X,Y) = \frac{\sum_{k=1}^n(X_k - \mu_X)(Y_k - \mu_Y)}{n-1}$

As we can see from the covariance equation, that if we were to compute the covariance of X with respect to itself, we would simple arrive with the variance of X.  One significant difference between variance and covariance is that covariance can be negative where as variance only gives positive values.  When a negative covariance is calculated, it indicates that one data set decreases in value while the other increases.  Conversely a positive covariance value indicates that the data sets increase together or decrease together.

Now say, we had more than 2 dimensions in our data set and we wanted to investigate the relationships between them.  We can compute a series of covariances and put them into a covariance matrix.  This is particularly useful if we have more than 3 dimensions  and wish to investigate the trends between them as we lose the ability to plot the data in a simple way.  Consider an example where we wanted to look at the relationship between X, Y, and Z data sets.  We would construct a covariance matrix using a combination of covariances:

$\overline{cov} = \begin{bmatrix} cov(x,x) & cov(x, y) & cov(x, z) \\ cov(y, x) & cov(y, y) & cov(y, z) \\ cov(z, x) & cov(z, y) & cov(z, z) \end{bmatrix}$

A couple of interesting things to note about the covariance matrix:

1. The diagonal is the variance of the individual columns of the set
2. The matrix is symmetrical about the diagonal i.e. $cov(X, Y) = cov(Y, X)$

Eigenvectors and Eigenvalues

In order to complete PCA we need to use and understand eigenvalues and eigenvectors. These are two important concepts in linear algebra and the proofs are quite difficult to say the least.  As such we will not cover the proof of eignvectors and values but rather enough to get an understanding of what they do and how you can use them.  Please pay attention to the Jupyter notebook provided with the post, in particular to the section demonstrating how to calculate the eignenvalues and eigenvectors in Python.

A vector v is said to be an eigenvector of a linear transform if it is a non zero vector that does not change its direction when a linear transformation T is applied to it.  The eigenvalue λ  is a scalar that multplies by the eigenvector to give the transformed eignvector i.e.

$T(\boldmath{v}) = \boldmath{\lambda.v}$

If the linear transform T is a square matrix then:

$\boldmath{Av} = \boldmath{\lambda.v}$

Wikipedia has this excellent visual representation to help explain the physical properties of eigenvalues and eigenvectors.  Consider the following image of the Mona Lisa, the image on the right has been sheared, as you can see from the change in direction of the red array and the effect on the image.  The blue arrow is an eigenvector as its direction has not changed between the left and right images; the corresponding eigenvalue is 1 since the length of the eigenvector has not changed.  When we consider eigenvalues and eigenvectors, we should consider them as pairs as they are almost always used in conjunction with each other

Smith (2002) provides a good summary of the properties of eigenvalues and eigenvectors:

1. You can only find eigenvectors for square matrices
2. Not all square matrices have eigenvectors
3. Multiplying an eigenvector by a scalar does not change its direction, only its length
4. All eigenvectors of a matrix are perpendicular to one another, irrespective of however many dimensions are in the matrix.  This is one advantage in that visually it is impossible to represent more than 3 orthogonal axis.

You can calculate eigenvectors and eigenvalues of small matrices e.g. 3 x 3 by hand, however anything larger requires special techniques.  In practice though you would rarely do this, instead a software package such as Python or Octave would be used.  The Jupyter notebook for the post demonstrates how to calculate the eigenvalues and eigenvectors in Python.

The PCA Process

So now we have all of the components required to complete principal component analysis.  So let’s go through the process.  In the Jupyter notebook I have added some fake data, corresponding to the number of coffees consumed vs the number of pages of thesis written.  This data has been loaded into the variable coffee_v_thesis though feel free to use other data sets as you wish.

1.  Once we have our data we first need to subtract the mean for the corresponding axes.

$X = X - \overline{x}$

or

$x = x - \mu_x$

$y = y - \mu_y$

2.  Next we calculate the covariance matrix for the mean adjusted data.

3.  Calculate the eigenvalues ($\lambda$) and eigenvectors ($E$) of the covariance matrix.  Order the eigenvalues such that they are in decreasing order.  Each eigenvalue provides the variance of the data about the mean for the corresponding eigenvector.  The largest eigenvalue is known as the principal component as it contributes the most variance about the mean for the data set.  The total variance can be determined by summing the eigenvalues:

$V_T = \sum_{t}\lambda_i$

4.  Now we have the total variance of the data and the proportion of variance provided by each eigenvalue.  This is where we can begin to “compress” the data; first we choose the amount of variance we wish to represent in our model or how much we wish to “compress” the data.  Say we want to represent 98% of variance ($f_v$) in our model for example.  The next step is to take a cumulative sum of the individual eigenvalues divided by $V_T$ and select  the first eigenvalues that add up to give at least the specified variance $f_v$.  Once the eigenvalues have been selected take note of the corresponding eigenvectors as these will be needed in the next step.

5. With the selected eigenvalues, take the corresponding selected eigenvectors and assign them to P

$\boldmath{P} \subset \boldmath{E}$

6. Now with P we can transform the mean adjusted data set.  Once we execute the transform we can no longer consider the data as parameterized by the original x and y parameters, or in the case of the github example coffees consumed vs pages written.  The data now lies in this separate space that is parameterized by P. The data can be transformed by:

$\boldmath{x_{transformed}} = P\boldmath{X}^T$

Recovering the Data

Say we have done some work with the data and we now want to return to the original parameterized space.  To do this we need to transform with P again and return the mean values we subtracted earlier.

$x = \overline{x} + \boldmath{P}x_{transformed}^T$

One thing to note is that unless you have used all of the eigenvectors during the selection process (step 3 above) you will not be able to completely recreate the original data set.  By not using all of the eigenvectors some of the information has been lost.  The image below demonstrates how much information can be lost if only the principal component is used.

This brings us to the end of an initial look into PCA, if you have any questions or comments regarding this blog post or the github code please leave a post and I will do my best to answer them.  Also any corrections / suggestions for improving the post are more than welcome.

Cheers,

tinhatben

References

L. I. Smith, “A tutorial on Principal Components Analysis Introduction,” Statistics (Ber)., vol. 51, p. 52, 2002.