section week 12: week: principal components & dimensionality reduction

by Verena Volf, 11/30/2018

You can also download this page in jupyter notebook format.

Basic linear algebra

a. ) Transposing a Matrix

The tranpose of matrix $A$ is denoted as $A^{\rm T}$.

The transpose of an $m$ × $n$ matrix $A$ is the $n$ × $m$ matrix $A^{\rm T}$ whose columns are the rows of $A$. The columns of $A^{\rm T}$ are the rows of $A$. The rows of $A^{\rm T}$ are the columns of $A$.

In [28]:
import numpy as np 

A = np.array([[1, 2, 3], [4, 5, 6]])
print(A)
[[1 2 3]
 [4 5 6]]
In [82]:
A_T = A.transpose()
A_T
Out[82]:
array([[1, 4],
       [2, 5],
       [3, 6]])
In [36]:
A.shape
Out[36]:
(2, 3)
In [37]:
A_T.shape
Out[37]:
(3, 2)

b.) Calculating the dot product of two vectors

In [38]:
a = [1, 2, 3]
b = [-2, 0, 5]
np.dot(a, b) # 1*(-2) + 2*0 + 3*5
Out[38]:
13

If two vectors are perpendicular, then their dot-product is equal to zero.

c.) Matrix multiplication

In matrix multiplications, each entry of our product matrix is the dot product of a row in the first matrix and and a column in the second matrix. In order to be able to calculate this, the number of columns in the first matrix must be equal to the number of rows in the second matrix.

Here, we multiply two 2 x 2 matrices.

In [50]:
a = np.array(([3, 1], [8, 2]))
a
Out[50]:
array([[3, 1],
       [8, 2]])
In [51]:
b = np.array(([6, 1], [7, 9]))
b
Out[51]:
array([[6, 1],
       [7, 9]])
In [53]:
# in python, the operator '@' can be used for matrix multiplications
s = a @ b
s
Out[53]:
array([[25, 12],
       [62, 26]])
In [84]:
# Let's confirm the result.

s_2 = np.zeros((2, 2))
s_2[0,0] = a[0, 0] * b[0, 0] + a[0, 1] * b[1, 0] 
s_2[0,0]
Out[84]:
25.0
In [85]:
s_2[0,1] = a[0, 0] * b[0, 1] + a[0, 1] * b[1, 1]
s_2[0,1]
Out[85]:
12.0
In [86]:
s_2[1,0] = a[1, 0] * b[0, 0] + a[1, 1] * b[1, 0]
s_2[1,0]
Out[86]:
62.0
In [87]:
s_2[1,1] = a[1, 0] * b[0, 1] + a[1, 1] * b[1, 1]
s_2[1,1]
Out[87]:
26.0
In [88]:
s_2
Out[88]:
array([[25., 12.],
       [62., 26.]])
d.) Diagonal of a matrix
In [63]:
np.diag(s)
Out[63]:
array([25, 26])

1.) PCA overview

We start by being given a $n×p$ data matrix. The columns $p$ are our measured variables, and the lines are our observations $n$.

We can plot each of our observations into a p-dimensional Euclidian space. The dimensions here can for example be expression levels of a gene of an RNA sequencing experiment, measured in $n$ individual cells or experiments.

In a two dimensional space it is easy to see how points are correlated just by looking at the plotted observations. Think for example of the homework assignment 06, where you used single-cell RNA-seq data in the two dimensions of the genes kiwi and carraway. In high dimensional data such as it is the case for RNA-seq, there are thousands of genes p and we need to find another approach if we want to visualize the correlation between our observations. PCA is one of the techniques which aims at reducing the dimensionality of high-dimensional data.

PCA does so by finding a a set of orthogonal vectors that maximizes the variance. Your first principal component PC1 is corresponds to the vector that, when data points are projected onto it, captures the highest variance. The second principal component is a vector orthogonal to PC1 which captures the next highest variance, and so on.

Let's visualize this in two dimensional space:

alt text

In [75]:
#this image was taken from
#https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues
#Image(url="Q7HIP.gif")

We can think of this as drawing a line through the origin, and rotating that line to find the best fit. In order to quantify how good the fit of the line is, the points get projected onto the line and then we can either measure the distances from the data points to the line and minimize those distances, or we can try to find the orientation of the line which maximizes the distance from the projected points to the origin. These two options are equivalent and we can see in the simulation above how the distances from the projected points to the origin get bigger as the distances of the data points to the line decrease.

The line which fits best is the eigenvector with the highest eigenvalue. The eigenvalue is the sum of squared distances of data points projected onto the best fitting line to to origin.

The eigenvector is a unit vector in the direction of the principal component: It has a length of 1, $‖xj‖=1$. It is a linear combination of the real dimensions.

We can now find vectors orthogonal to PC1 and repeat the procedure to find the eigenvector corresponding to PC2 which has the second highest eigenvalue. Since the eigenvectors are orthogonal to each other, the dot productof any pair $j,k$ of them is $wj⋅wk=0$.

We are working on a 2D data set here and we know that the remaining principal component is orthogonal to the first, so the direction of the 2nd component is already determined by the first line.

Singular Value Decomposition Matrix Calculation

In class we discussed two different decompositions methods to do PCA; eigendecomposition and singular value decomposition (SVD). I am only going to focus on SVD here, which is the method that you are going to use in the homework assignment. In SVD, we decompose the original matrix, as opposed to the covariance matrix which we would use in some other techniques. $X$ is a $n×p$ data matrix. $r$ or rank is the dimensionality of the space spanned by the data matrix $X$: $r≤min(n,p)$.

1.) Center the data

PCA per definition works on centered values. Therefore we start first center our MxN data matrix $X$ by subtracting the column mean in each dimension.

$$x^{*}ij=xij−\bar{x}j$$

When we translate the data from its actual mean value to a mean value of 0 in all dimensions, the position of the data points relative to each other stay the same.

2.) Singular Value decompoistion

Next, we decompose the MxN centered matrix $X$ using singular value decomposition, into the product $ U \cdot S \cdot W^{\rm T}$.

$$ X^{*} = USW^{\rm T}$$

Matrix $U$ is an n×r matrix.

Matrix $S$ is a diagonal $r×r$ matrix.
It realtes to the eigenvalues as follows: $$Λ=\frac{S^{2}}{{n−1}}$$

Matrix $W^{\rm T}$ is an $r×p$ matrix, where r is the rank of $X$. The rows are the eigenvectors. It can be transposed to $W$, with dimensions $p×r$. Each column of matrix $W$ is a p-dimensional eigenvector $wj$, for a total of $r$ eigenvectors with nonzero eigenvalues.

Singular Value decomposition in Python

In Python, singular value decomposition can be performed by using np.linalg.svd(), which returns the three matrices above. When doing the homework, pay attention that the dimension of the W matrix is indeed $r×p$. Take a look at the online documentation. You might need to set 'full_matrix' to 'False' in order to get it work.

Handwritten digits example

In this demonstration we will use PCA to analyze handwritten digits and remove noise. I am going to use a different PCA function than the one that you will use for your homework.

The data set we are analyzing contains a total of 1797 hand written digits, with a dimension of 64. Each of the 64 dimensions represents a pixel value. Lets take a look at the data.

In [65]:
# demonstration on handwritten digits is adapted from last the 2017 MCB112 section by James Xue


#parts of code was adapted from the python datascience  handbook:  
#https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html

%matplotlib inline
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from IPython.display import HTML, Image
from sklearn.datasets import load_digits
digits = load_digits()

#taking a peak at the data (the first 10 digits)
print(digits.data.shape)

digits.data[0:10,:]
(1797, 64)
Out[65]:
array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
        15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
        12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
         0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
        10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.],
       [ 0.,  0.,  0., 12., 13.,  5.,  0.,  0.,  0.,  0.,  0., 11., 16.,
         9.,  0.,  0.,  0.,  0.,  3., 15., 16.,  6.,  0.,  0.,  0.,  7.,
        15., 16., 16.,  2.,  0.,  0.,  0.,  0.,  1., 16., 16.,  3.,  0.,
         0.,  0.,  0.,  1., 16., 16.,  6.,  0.,  0.,  0.,  0.,  1., 16.,
        16.,  6.,  0.,  0.,  0.,  0.,  0., 11., 16., 10.,  0.,  0.],
       [ 0.,  0.,  0.,  4., 15., 12.,  0.,  0.,  0.,  0.,  3., 16., 15.,
        14.,  0.,  0.,  0.,  0.,  8., 13.,  8., 16.,  0.,  0.,  0.,  0.,
         1.,  6., 15., 11.,  0.,  0.,  0.,  1.,  8., 13., 15.,  1.,  0.,
         0.,  0.,  9., 16., 16.,  5.,  0.,  0.,  0.,  0.,  3., 13., 16.,
        16., 11.,  5.,  0.,  0.,  0.,  0.,  3., 11., 16.,  9.,  0.],
       [ 0.,  0.,  7., 15., 13.,  1.,  0.,  0.,  0.,  8., 13.,  6., 15.,
         4.,  0.,  0.,  0.,  2.,  1., 13., 13.,  0.,  0.,  0.,  0.,  0.,
         2., 15., 11.,  1.,  0.,  0.,  0.,  0.,  0.,  1., 12., 12.,  1.,
         0.,  0.,  0.,  0.,  0.,  1., 10.,  8.,  0.,  0.,  0.,  8.,  4.,
         5., 14.,  9.,  0.,  0.,  0.,  7., 13., 13.,  9.,  0.,  0.],
       [ 0.,  0.,  0.,  1., 11.,  0.,  0.,  0.,  0.,  0.,  0.,  7.,  8.,
         0.,  0.,  0.,  0.,  0.,  1., 13.,  6.,  2.,  2.,  0.,  0.,  0.,
         7., 15.,  0.,  9.,  8.,  0.,  0.,  5., 16., 10.,  0., 16.,  6.,
         0.,  0.,  4., 15., 16., 13., 16.,  1.,  0.,  0.,  0.,  0.,  3.,
        15., 10.,  0.,  0.,  0.,  0.,  0.,  2., 16.,  4.,  0.,  0.],
       [ 0.,  0., 12., 10.,  0.,  0.,  0.,  0.,  0.,  0., 14., 16., 16.,
        14.,  0.,  0.,  0.,  0., 13., 16., 15., 10.,  1.,  0.,  0.,  0.,
        11., 16., 16.,  7.,  0.,  0.,  0.,  0.,  0.,  4.,  7., 16.,  7.,
         0.,  0.,  0.,  0.,  0.,  4., 16.,  9.,  0.,  0.,  0.,  5.,  4.,
        12., 16.,  4.,  0.,  0.,  0.,  9., 16., 16., 10.,  0.,  0.],
       [ 0.,  0.,  0., 12., 13.,  0.,  0.,  0.,  0.,  0.,  5., 16.,  8.,
         0.,  0.,  0.,  0.,  0., 13., 16.,  3.,  0.,  0.,  0.,  0.,  0.,
        14., 13.,  0.,  0.,  0.,  0.,  0.,  0., 15., 12.,  7.,  2.,  0.,
         0.,  0.,  0., 13., 16., 13., 16.,  3.,  0.,  0.,  0.,  7., 16.,
        11., 15.,  8.,  0.,  0.,  0.,  1.,  9., 15., 11.,  3.,  0.],
       [ 0.,  0.,  7.,  8., 13., 16., 15.,  1.,  0.,  0.,  7.,  7.,  4.,
        11., 12.,  0.,  0.,  0.,  0.,  0.,  8., 13.,  1.,  0.,  0.,  4.,
         8.,  8., 15., 15.,  6.,  0.,  0.,  2., 11., 15., 15.,  4.,  0.,
         0.,  0.,  0.,  0., 16.,  5.,  0.,  0.,  0.,  0.,  0.,  9., 15.,
         1.,  0.,  0.,  0.,  0.,  0., 13.,  5.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  9., 14.,  8.,  1.,  0.,  0.,  0.,  0., 12., 14., 14.,
        12.,  0.,  0.,  0.,  0.,  9., 10.,  0., 15.,  4.,  0.,  0.,  0.,
         3., 16., 12., 14.,  2.,  0.,  0.,  0.,  4., 16., 16.,  2.,  0.,
         0.,  0.,  3., 16.,  8., 10., 13.,  2.,  0.,  0.,  1., 15.,  1.,
         3., 16.,  8.,  0.,  0.,  0., 11., 16., 15., 11.,  1.,  0.],
       [ 0.,  0., 11., 12.,  0.,  0.,  0.,  0.,  0.,  2., 16., 16., 16.,
        13.,  0.,  0.,  0.,  3., 16., 12., 10., 14.,  0.,  0.,  0.,  1.,
        16.,  1., 12., 15.,  0.,  0.,  0.,  0., 13., 16.,  9., 15.,  2.,
         0.,  0.,  0.,  0.,  3.,  0.,  9., 11.,  0.,  0.,  0.,  0.,  0.,
         9., 15.,  4.,  0.,  0.,  0.,  9., 12., 13.,  3.,  0.,  0.]])

Each row is an image with 64 pixels. The values of the 64 pixels are in each column. Lets take a look at 40 of the digits.

In [66]:
def plot_digits(data):
    fig, axes = plt.subplots(4, 10, figsize=(10, 4),
                             subplot_kw={'xticks':[], 'yticks':[]},
                             gridspec_kw=dict(hspace=0.1, wspace=0.1))
    for i, ax in enumerate(axes.flat):
        ax.imshow(data[i].reshape(8, 8),
                  cmap='binary', interpolation='nearest',
                  clim=(0, 16))
plot_digits(digits.data)
In [68]:
pca = PCA(2)  # reduce from 64 to 2 dimensions
projected = pca.fit_transform(digits.data)
print(digits.data.shape)
print(projected.shape)
plt.scatter(projected[:, 0], projected[:, 1],
            c=digits.target, edgecolor='none', alpha=0.5,
            cmap=plt.cm.get_cmap('Spectral', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar();
(1797, 64)
(1797, 2)

Above, we can see how the digits can be captured by the 2 components. Now, lets try to visualize an example of reconstructing the data using the first 8 components.

In [78]:
def plot_pca_components(x, coefficients=None, mean=0, components=None,
                        imshape=(8, 8), n_components=8, fontsize=12,
                        show_mean=True):
    if coefficients is None:
        coefficients = x
        
    if components is None:
        components = np.eye(len(coefficients), len(x))
        
    mean = np.zeros_like(x) + mean
        

    fig = plt.figure(figsize=(1.2 * (5 + n_components), 1.2 * 2))
    g = plt.GridSpec(2, 4 + bool(show_mean) + n_components, hspace=0.3)

    def show(i, j, x, title=None):
        ax = fig.add_subplot(g[i, j], xticks=[], yticks=[])
        ax.imshow(x.reshape(imshape), interpolation='nearest')
        if title:
            ax.set_title(title, fontsize=fontsize)

    show(slice(2), slice(2), x, "True")
    
    approx = mean.copy()
    
    counter = 2
    if show_mean:
        show(0, 2, np.zeros_like(x) + mean, r'$\mu$')
        show(1, 2, approx, r'$1 \cdot \mu$')
        counter += 1

    for i in range(n_components):
        approx = approx + coefficients[i] * components[i]
        show(0, i + counter, components[i], r'$w_{0}$'.format(i + 1))
        show(1, i + counter, approx,
             r"${0:.2f} \cdot w_{1}$".format(coefficients[i], i + 1))
        if show_mean or i > 0:
            plt.gca().text(0, 1.05, '$+$', ha='right', va='bottom',
                           transform=plt.gca().transAxes, fontsize=fontsize)

    show(slice(2), slice(-2, None), approx, "Approx")
    return fig


pca = PCA(n_components=8)
Xproj = pca.fit_transform(digits.data)
sns.set_style('white')
fig = plot_pca_components(digits.data[10], Xproj[10],
                          pca.mean_, pca.components_)

The panel on the left is true image. The panel on the right is the image only looking at the first eight principal components. Notice that the mean, $\mu$, is added first since in the original data matrix, we subtracted the column means from the data, we need to add it back to reconstruct the data. The first row represents the eigenvectors. $w_i$ represents the $i$th column of the eigenvector matrix $W$. The row in the middle shows the incremental sum of including each eigenvector multiplied by its score.

Another way that PCA can be used is to remove noise from lower dimensions, lets add random noise to the handwritten digits data, keep the top 50 principal components, then reconstruct the data.

In [79]:
np.random.seed(42)
#add random noise from a normal
noisy = np.random.normal(digits.data, 4)
plot_digits(noisy)

Reconstruction

In [80]:
components = pca.transform(noisy)
filtered = pca.inverse_transform(components)
plot_digits(filtered)

We see that after data reconstruction, it looks as if the noise is removed.