MCB112: Biological Data Analysis (Fall 2017)
 goals this week
 what is dimensionality?
 enter PCA
 line of best fit
 covariance
 eigenvectors and eigenvalues
 solving a toy 2D problem
 taking it to N dimensions
 projections
 reconstructions
 how many components?
 pca won’t won’t always be clear
 further reading
week 11:
principal components & dimensionality reduction
“Distress not yourself if you cannot at first understand the deeper mysteries of Spaceland. By degrees they will dawn upon you.” – Edwin A. Abbott, Flatland: A Romance of Many Dimensions
(These notes were originally written by Tim Dunn, for MCB112 in fall 2016.)
goals this week
Last week, we saw nonnegative matrix factorization, a technique for decomposing a data matrix into components that illuminate hidden, simpler structure in the data. This week we’ll be learning about another matrix decomposition technique called principal component analysis (PCA).
PCA is a classic tool for data analysis. It is frequently used for data visualization in RNAseq analysis, but was developed first in physics and statistics, and can be traced back to a 1901 paper by Karl Pearson, where he discusses the “line of best fit.”
Proving why PCA works is beyond the scope of the course, and even the linear algebra needed to describe how it works is confusing (at least to me!) and would take longer to properly explain than we have. PCA is an example of a method that tends to get used without understanding it, and unfortunately, we’re not going to do much better this week, so I’m not throwing any stones in my glass house. Instead, our goal is to:

explain what PCA is trying to do

make sure you know how to do PCA analyses in Python

give you just enough of the linear algebra that you see that it’s something worth looking deeper into, if you take a subsequent stats or linear algebra course (or if you want to study it on your own).
The main reason we use PCA in data analysis is to reduce the dimensionality of our data.
what is dimensionality?
Suppose we have a tidy MxN data table where we have measured N variables (or features), in columns, for M observations, in rows. For example, we might have singlecell RNAseq data for M individual cells, measuring mapped read counts for N different genes in each cell.
We can imagine each observation (each cell) as a point in an Ndimensional Euclidean space. For example, in pset 06 we did kmeans clustering of single cell RNAseq data in two dimensions, for two genes kiwi and caraway.
In real RNAseq data, the number of genes N is much larger than 2. Many other sorts of biological experiments these days generate “highdimensional” data too. As soon as N gets bigger than 2 or 3, the dimensionality of the “space” that we’re plotting our points (observations) in becomes difficult for us to visualize, and it’s hard to see how points are clustered or correlated.
Here’s something to watch out for. I might be interested in how cells are related to each other, in gene expression space; but I could also be interested in how genes are related to each other, in cell type space. When you look at someone’s PCA plot of their RNAseq data, one first thing to think about is, what are the points  genes, or cells? I’m going to talk about the observations (points) being cells, but we could just as easily transpose the analysis problem and talk about the cells as the dimensions (features), and genes as the observations, and we’d be looking at how genes cluster in the Mdimensional space defined by expression values in different cells.
There are a variety of techniques for reducing the dimensionality of highdimensional data, so it can be visualized and studied more easily. PCA is one of the most commonly used.
enter PCA
Figure from [Scholz 2006]. PCA aims to find projections of the data onto a new set of axes that best reveal underlying structure. ('open image in new tab' to embiggen.)
PCA aims to find a rotation of the original N axes of your data, to a new set of N orthogonal axes. The first axis is the vector that captures the highest variance in your data, projected along that vector. The second axis is the orthogonal vector that captures the next highest variance, and so on.
These axes are called principal components (PCs). Each PC captures a fraction of the overall variance in your data; the sum of these fractional variances equals the total variance in your data.
If you have fewer observations than dimensions (M < N), then only the first M principal components will capture nonzero variance, for the same reason that you can always perfectly fit a 2D line to two points.
Some applications of PCA in biological data anlysis include:

determining the intrinsic dimensionality of a system [see Stephens et al. (2008) for an interesting example]. How many underlying features are generating the observed data? While we might collect data from a high number of variables, it’s possible that there are a small number of underlying processes generating the data. PCA is good at finding these underlying processes when they are independent. For instance, PCA on the Adler data set from last week’s pset immediately determines the correct number of underlying gene batteries.

denoising data [see Alter et al. for an example]. Measurements are often confounded by experimental artifacts that can obscure the types of signals that we care about. By emphasizing features of the data that show patterns across observations and variables, PCA can pick out real underlying structure and remove underlying, uncorrelated noise.

visualizing clusters. In the kmeans pset in week 6, we looked for clusters in gene expression patterns based on two genes, where it was easy to directly visualize the distributions of interest. But how can we see clusters when considering thousands of genes? Let’s say, in a simple case, there are only two genes out of thousand that are differentially expressed between clusters. If we can find these two genes and scatter plot them, then we might be able to see our clusters. But how can we find them, and how can we be sure that they are the only genes clustering out?
line of best fit
Figure from [Pearson 1901]. On lines and planes of closest fit to systems of points in space. ('open image in new tab' to embiggen.)
A place to start thinking about PCA is in two dimensions, with what Pearson called “the line of best fit” in a 1901 paper.
When we talked about regression, we talked about having an independent variable and a dependent variable , such that our observation of was a function of plus some (typically Gaussian) noise: , for linear regresssion. If we switched the dependent and independent variable, now looking to fit , in general we won’t get the same regression line.
Pearson recognized that in many cases it isn’t clearcut which variable is dependent and which is independent. It could be that they’re really both dependent on some underlying cause, so they covary.
As Pearson put it with a concrete example, “the most probable stature of a man with a given leg length being , the most probable length of leg for a man of stature will not be .” The regression fits are not invertible. Pearson proposed that the “line of best fit” is one without bias towards either variable. But what residuals do we minimize? We can find the line that minimizes the orthogonal distance between each point and the “line of best fit”, which is tantamount to an assumption of Gaussian noise added to the orthogonal directions around an underlying vector.
The “line of best fit” is equivalent to finding a line such that when the data points are projected onto it, the variance along that line is maximal. (Maximizing the variance along the line means minimizing the variance in all the other directions.) Pearson’s line of best fit is the first principal component of a 2D data set. The remaining principle component is forced to be orthogonal to it, so its direction is completely determined by the best fit line.
covariance
How do we find the line of best fit – especially when we’re talking about Ndimensional data, not just 2D? The next thing to think about is what covariance means.
Recall that variance, , was used to describe the spread of a set of observations of a variable with mean :
When we start to work with data in more than 1 dimension (i.e. with more than one variable), it’s possible that, in addition to the variance intrinsic to each of the single dimensions, there is some covariation between the variables. Covariation means there exists some relationship between the variables such that they change together.
The covariance calculation looks very similar to the variance calculation. In two dimensions, with a set of observations of variables and with individual means and , respectively:
With this definition in hand, we can also rewrite our original variance as merely the covariance between a variable and itself:
In N dimensions, we can compare all pairs and obtain a covariance matrix, denoted .
In PCA, our goal is to find axes that describe the most variance (here used generically to refer to any pattern of variability – covariance or otherwise), as we assume these axes may summarize our data in a useful way. We thus start by summarizing the variance and covariance in all dimensions using where each entry is the covariance between two variables (all possible combos). For two variables, x and y, the covariance matrix has four entries:
The diagonal entries are just the variances of of x and y. Also the matrix is symmetric, which you can see by looking back at our equation for covariance. It’s commutative – we’re quantifying the same relationship when we talk about as when we talk about .
When we generalize this matrix to dimensions (variables) , where each entry is :
OK, now go back to thinking about data in two dimensions, in the figure above. When we project the data along the two principal components (i.e. rotated into the basis of these new axes), there is zero covariance between the axes. We have rerepresented our data in terms of linearly uncorrelated basis vectors. One way to think about what PCA does is to find a rotation of the original axes such that there is zero covariance between any two of the new axes. This is the connection to why it becomes a socalled eigendecomposition problem in linear algebra.
eigenvectors and eigenvalues
The principal components are the eigenvectors of the data’s covariance matrix.
What on earth is an eigenvector? An eigenvector is a vector that occupies a special place in linear algebra. It represents a vector that is merely scaled when multiplied by a matrix, so:
\begin{align} \mathbf{Av} = \lambda \mathbf{v} \end{align}
For a 2D matrix, we can spell this out a bit more clearly. If is an eigenvector of , then:
For example, for matrix and vector
=
Thus, is an eigenvector of . That scalar multiple, , can change depending on or and is referred to as the eigenvalue associated with .
How do we find the eigenvectors of some arbitrary matrix , such as our covariance matrix? Linear algebra tells us that a square matrix (with positive eigenvalues) can be factored such that
\begin{align} \mathbf{A} = \mathbf{W} \Lambda \mathbf{W^{1}} \end{align}
where is a diagonalized matrix of the eigenvalues of , i.e.,
and is an matrix whose columns correspond to each eigenvector, , i.e., .
For a very small matrix, we can use the characteristic equation to solve for eigenvalues and eigenvectors by hand. The characteristic equation states that:
\begin{align} \mathrm{det}(\mathbf{A}  \lambda \mathbf{I}) = 0 \end{align}
where is the matrix determinant. For a matrix = ,
and is the identity matrix, , meaning .
Together, then, , so
This is a polynomial with exactly two roots (characteristic roots), which correspond to the eigenvalues of .
Once we know each eigenvalue, the eigenvectors (characteristic vectors) can then be found using our first eigenvector identity:
By plugging in each eigenvalue, , we can solve for its respective eigenvector. Because we define our principal components to be unit vectors, the final step is to normalize the magnitude of each eigenvector to 1,
solving a toy 2D problem
Let’s follow these steps to find the principal components of a toy dataset (from Chapter 1 of A User’s Guide to Principal Components, which offers a cogent description of the process).
PCA on our 2D toy dataset. Individual observations are plotted as red crosses. PC1 and PC2 plotted as black vectors, scaled by the square root their respective eigenvalues. ('open image in new tab' to embiggen.)
Obs. No.  C1  C2 

1  10.0  10.7 
2  10.4  9.8 
3  9.7  10.0 
4  9.7  10.1 
5  11.7  11.5 
6  11.0  10.8 
7  8.7  8.8 
8  9.5  9.3 
9  10.1  9.4 
10  9.6  9.6 
11  10.5  10.4 
12  9.2  9.0 
13  11.3  11.6 
14  10.1  9.8 
15  8.5  9.2 
We start by calculating the covariance matrix for the data.
= =
We then use the characteristic equation to solve for the eigenvalues.
The roots of this polynomial are and , which are the eigenvalues of the covariance matrix .
Knowing our eigenvalues, we then use to solve for our eigenvectors.
=
This is a system of linear equations with two unknowns, and . Because we will end up normalizing our resulting vector, we can solve this system easily by setting and solving the first equation.
Resulting in
And after normalizing,
The same strategy can be used to find that
The eigenvector matrix is thus
And our diagonalized matrix on eigenvalues is
And we can make sure that’s a solution, by plugging back in and verifying that:
taking it to N dimensions
Beyond 2D, using the above approach to solve for $N$ characteristic roots becomes tedious at best. Instead, one uses numerical, iterative procedures that, thankfully, NumPy can take care of for us (see numpy.linalg.eig()).
It turns out that, for several practical considerations, a different decomposition technique called singular value decomposition (SVD) is used to find the eigenvectors and eigenvalues of the covariance matrix. SVD bypasses some numerical errors inherent to the approximations involved in standard eigendecomposition (when calculating the covariance matrix) and also ends up being faster in many cases, especially when the number of observations is less than the number of variables, as it directly truncates the number of components for a matrix (when ) to vectors (instead of the you get from an eigendecomposition of the covariance matrix, with only at most nonzero eigenvalues).
NumPy also has a function
linalg.svd()
that will work directly on our data matrix (instead of the covariance
matrix) and decompose it into a different set of matrices that are
related to the eigenvectors and eigenvalues of the convariance
matrix. Knowing how to implement SVD will be important if you’re ever
working on your own large data sets, so one part of the homework is to
learn how to use scipy’s linalg.svd()
.
It’s pretty simple from a user standpoint. First you center your
data by subtracting the mean of each column (dimension)
to get a centered data matrix . You feed in your
centered data matrix , and out come three decomposed
matrices. You just have to suss out which of the decomposed matrix
outputs from linalg.svd()
to keep, and how they’re indexed.
Recall that our eigendecomposition was:
\begin{align} \mathbf{A} = \mathbf{W} \Lambda \mathbf{W^\top}, \end{align}
where is the covariance matrix of .
With singular value decomposition:
\begin{align} \mathbf{X} = \mathbf{USW^\top} \end{align}
The shared notation here is intentional. In SVD, you decompose the centered data matrix , not the covariance matrix, and one of the outputs is a transposed matrix of your eigenvectors, . The matrix is a diagonal matrix that relates to the diagonal matrix of eigenvalues:
\begin{align} \Lambda = \frac{\mathbf{S}^2}{N1} \end{align}
where is the number of variables (features) in your matrix of data. Matrix is in principle related to the projections of the data onto the principal axes, but we will end up just calculating these ourselves (see below).
A little more technical detail helps to see the relationship between SVD and the eigendecomposition of the covariance matrix. When the data are centered to have zero mean in each dimension, then the covariance matrix is . Its eigendecomposition is SVD gave us a different decomposition . So:
projections
Once you’ve found your principal axes, the next step is to project your data onto them. These projections are commonly referred to as “scores” (i.e. how did the data “score” on a particular axis). A projection onto each PC is just the dot product of each principal component vector and a row of your data matrix.
Note that the dimensions need to line up correctly. For each single principal component vector, , of length , of an centered data set (with a covariance matrix of dimensions ), there are principal scores (i.e. a score vector, , of length , for each of PCs) representing the position of each data point on just this PC. This makes sense because you should have as many projected points as you have actual points,
You can also calculate an entire score matrix by taking the dot product of your data matrix and each principal component, e.g. the eigenvector matrix:
\begin{align} \mathbf{S} = \mathbf{X}\mathbf{V} \end{align}
If each column of is a principal component, then the columns of will be , i.e. the scores for an individual principal component.
Now we can see how to reduce the dimensionality of our data. Instead of projecting our data onto all the eigenvectors, we can project it onto just a few. We order the principal components in decreasing order of their eigenvalues, and visualize our data in 2D (say) by projecting the data onto the first two PCs, by using only the leading two columns of .
reconstructions
Another useful thing we can do is to project the data down onto a subset of the principal axes, then reconstruct it back into the original Nspace. When we use fewer principal components than there are variables, then we get a new representation of the data that appears to have the same dimensional structure () but whose values are determined only by the selected principal components. This can be useful for denoising data, when we think that the top few PCs are signal, and the rest of the variance is noise. By reconstructing the data only from projections onto axes that represent our signal, we can discard uncorrelated components of the data that may otherwise obscure our features of interest.
To reconstruct the data using a single component:
To reconstruct the data using multiple components, we take the sum over reconstructions from individual components,
Since the data were centered, then to do a true reconstruction, the column means (over observations) must be added back to turn the reconstructed into a reconstructed .
how many components?
A full eigen decomposition of the covariance matrix will result in as many eigenvectors as there are variables. To completely describe the data though, you actually only need components for a matrix with . This may be a bit counterintuitive. One way to see it is geometrical. Imagine that you made two observations of three variables. We can plot these two observations as two points within 3D space. Now imagine the line that explains most of their variance; it passes through both points. That’s principal component 1. But what’s principal component 2? There’s no more variance left to explain. A similar problem occurs when there are three points in 3D. The points are coplanar, so the first PC lies in the plane, and the second (orthogonal) PC also lies in the plane. There’s no other place where it makes sense to draw new axes.
But more than that issue, we’re not aiming to fully reconstruct the data. PCA has (hopefully) generated a new projection of our data using much fewer dimensions. How do we know if it has succeeded? How many of our eigenvectors we should be taking seriously? This is where we can analyze the eigenvalues, which are the variance along each principal component.
It’s common, then, to plot the eigenvalues, or the cumulative sum of the normalized eigenvalues. The latter graph shows the proportion of variance explained by the first K principal components. Often this is used to motivate retaining some subset of components for analysis. For example, if the first four principal components describe ~100% of the variance (like in the Adler data set), you can retain just the first four components without losing any sleep. But often it’s not so clear.
Another strategy is to compare the eigenvalues from the real data set to the eigenvalues expected from a matrix containing only random noise (ideally on the same spectrum as the noise in the real data). Any components above what is expected by chance alone can be emphasized.
pca won’t won’t always be clear
You may have realized that some aspects of PCA, especially choosing the right number of components, can be a bit of an art. Because of this, PCA is sometimes considered to be more of a first wave data exploration tool than a rigorous description of the data.
One common problem is that your effect or clusters will be hidden in a component that explains a relatively small amount of the variance, making it easy to miss (if your firstpass involves throwing out low variance components). It can thus be useful to explore smaller components. But the more components you explore, the less you’ve effectively reduced your dimensionality. ad hoc tests on the eigenvalues can help with these, but they aren’t perfect.
further reading
Joliffe and Cadima, 2016 is a nice, terse review of PCA. I still feel that my own understanding of PCA remains weak, so if you think my notation may be messed up at some point, it’s quite possible I’ve made a mistake; refer to a source like this to verify.