Week 10 Section: Principal Components and Dimensionality Reduction

Dan Greenfield (11/10/19), with inspiration from Verena Volf’s 2018 notes.

You can also download these notes in .ipynb format for Jupyter notebook.

Linear Algebra Review

1. Matrix Transpose

Given an n row x m column matrix $$A$$, its transpose $$A^{T}$$ is a m row x n column matrix

How? The rows of $$A$$ become the columns of $$A^{T}$$ (equivalently, the columns of A become the rows of $$A^{T}$$)

import numpy as np

A = np.array([[1, 2], [4, 5], [7, 8]])

A
array([[1, 2],
[4, 5],
[7, 8]])
A.T 
array([[1, 4, 7],
[2, 5, 8]])
A.transpose()
array([[1, 4, 7],
[2, 5, 8]])
print('Matrix A has original shape {}, and A transpose has shape {}'.format(A.shape, A.T.shape))
Matrix A has original shape (3, 2), and A transpose has shape (2, 3)

Using ‘.T’ or ‘.transpose()’ both are valid ways to compute matrix transposes. It is always good to double check the dimensions (shape) of your matrices to ensure nothing went awry.

2. Dot Products

Dot products are the basis of matrix multiplication. Let’s evaluate a base case of a dot product between two vectors:

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

a, b
(array([1, 2, 3]), array([4, 5, 6]))
np.dot(a, b)  # 1*4 + 2*5 + 3*6
32

3. Matrix Multiplication

Uses the dot product to evaluate the product of two matrices. Operates by taking dot products between rows of the first matrix and columns of the second.

The number of columns in the first matrix MUST EQUAL the number of rows in the second matrix in order to multiply two matrices.

If the above condition is satisfied, i.e. $M_{1}$ has dimensions $$d_{1}$$ x $$n$$ and $M_{2}$ has dimensions $$n$$ x $$d_{2}$$, then the output of the multiplication is a matrix $M_{3}$ with dimensions $$d_{1}$$ x $$d_{2}$$

Example:

A = np.array([[3, 1], [8, 2]])

B = np.array([[6, 1], [7, 9]])

A
array([[3, 1],
[8, 2]])
B
array([[6, 1],
[7, 9]])
A @ B 
array([[25, 12],
[62, 26]])

Let’s explicitly calculate each element of the matrix

M = np.zeros((2, 2))
M[0,0] = A[0, 0] * B[0, 0] + A[0, 1] * B[1, 0]
M[0,0]
25.0
M[0,1] = A[0, 0] * B[0, 1] + A[0, 1] * B[1, 1]
M[0,1]
12.0
M[1,0] = A[1, 0] * B[0, 0] + A[1, 1] * B[1, 0]
M[1,0]
62.0
M[1,1] = A[1, 0] * B[0, 1] + A[1, 1] * B[1, 1]
M[1,1]
26.0
M
array([[25., 12.],
[62., 26.]])

We see that our element wise calculation is the same as pythons. This is comforting.

4. Diagonal of a Matrix

This is simple: it is just the elements along the diagonal, starting at the top left element down to the bottom right.

np.diag(M)
array([25., 26.])

PCA

Some notes:

1. PCA is a common method of dimensionality reduction

2. At a high level, PCA finds vectors that best fit the data by encompassing as much variance as possible

3. Example in 2D below: 
#Source https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

PCA reduces dimensionality finding a a set of orthogonal vectors that maximizes the variance. Your first principal component PC1 is corresponds to the vector that, when data points are projected onto it, captures the highest variance. The second principal component is a vector orthogonal to PC1 which captures the next highest variance, and so on.

There are two main ways that PCA is implemented: eigendecomposition and singular value decomposition.

First, let’s talk about eigendecomposition. Say we are working with 2D data, as above. Then, there are 2 eigenvectors of the data that can be calculated, along with eigenvalues for that data. The largest eigenvalue corresponds to the line/vector that best fits the data, the smaller one is orthogonal to the first vector. The mathematical definition of eigenvalue, eigenvector pairs is:

                                             A * v = λ * v

Here, v is our eigenvector and $$\lambda$$ is a scalar. What this really means is that the product of our full data matrix A and eigenvector v is the same as eigenvector v times a scalar. Thus, the eigenvalue/eigenvector pair effectively summarizes the information in A. When we have n-dimensional data, we can choose the first few eigenvectors with largest eigenvalues, and these will be our prinicpal components of the data.

In PCA, these eigenvalues and eigenvectors are computed from the covariance matrix of the data. Iteratively computing covariances and calculating all eigenvalues and eigenvectors for high dimensional data is a slow and tedious task. Thus, SVD is more commonly used as it is faster and still takes eigenvalues and eigenvectors into account.

Singular Value Decomposition (SVD)

SVD differs from eigenvalue decomposition in many ways. Importantly, it operates on the original data matrix itself, rather than the covariance matrix. SVD is a matrix factorization technique.

Before going into SVD background, let’s review one more linear algebra concept: the rank (r) of a matrix.

Matrix rank is defined as the number of nonzero rows or columns of the reduced row echelon form (RREF) of a matrix. RREF uses linear combinations of rows of the data to eliminate non-unique rows. Thus, we can consider the rank of a matrix to be the number of linearly independent (or unique) rows of the RREF matrix. This is important, as we don’t want to carry out calculations on a row that is actually encompassed by the values in another row of data, as this would skew our calculations towards these rows.

Now, what are singular vectors and singular values?

If we consider a matrix $$A$$ and its transpose $$A^T$$, we can look at two different matrix multiplication products: $$AA^T$$ and $$A^TA$$. These matrix products satisfy a few properties:

1. Symmetric
2. Square
3. Positive semi-definite (PSD - all eigenvalues >= 0)
4. Both products have equal and positive eigenvalues
5. Both products have the same rank as the original matrix A

In SVD, $$U = AA^T$$ and $$V = A^TA$$. U is a matrix of left singular vectors of A V is a matrix of eigenvectors of A

The norm of the vectors of these matrices equal 1, so U and V are technically orthonormal .

As mentioned, SVD is a matrix factorization technique. Let’s break down what the factorization entails, assuming a data matrix X with n rows, p columns, and rank r. SVD claims that our matrix X can be decomposed into:

$X = U S V^T$

We know what U is $$XX^T$$ and V is $$X^TX$$, but what is S?

S is a diagonal matrix with r (rank(X)) elements corrresponding to the square root of the eigenvalues of U. These values in S are called the singular values .

Let’s breakdown this decomposition further, to convince ourselves that our matrix dimensions aren’t messed up.

$X_{n x p} = U_{n x n} * S_{n x p} * V^T_ {p x p}$

Looking at the right side of this equation, we see the output of our decomposition is, indeed, a n x p matrix. Good.

How does this relate to PCA?

For SVD to accurately organize the eigenvectors by decreasing eigenvalues, the data matrix must be centered. This means that our data matrix X is modified by subtracting the column mean in each dimension. This allows for the data to be translated to a mean value of 0, thus the position of the data points relative to one another stays the same.

Once we compute the SVD, we can take the first n columns of matrix V (assuming that the columns are sorted by decreasing eigenvalues), and these columns will be our principal components! Multiplying our centered data X* by this matrix V would effectively project our data along the PCs.

Going through the steps of manually doing SVD is unnecessary.

Now, you probably could figure out how to implement SVD from scratch, as long as you keep track of the dimensions of your matrices (i.e. shape of your numpy arrays). However, numpy has a built in SVD method: np.linalg.svd() that will output $$U$$ and $$S$$ by default and $$V$$ when instructed to. Find the online documentation here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html.

This post has plenty of information about SVD and a nice example: https://medium.com/@jonathan_hui/machine-learning-singular-value-decomposition-svd-principal-component-analysis-pca-1d45e885e491

PCA on MNIST data

In this demonstration we will use PCA to analyze handwritten digits and remove noise. I am going to use a different PCA function than the one that you will use for your homework.

The data set we are analyzing contains a total of 1797 hand written digits, with a dimension of 64. Each of the 64 dimensions represents a pixel value. Lets take a look at the data.

# demonstration on handwritten digits is adapted from last the 2017 MCB112 section by James Xue

#parts of code was adapted from the python datascience  handbook:
#https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html

%matplotlib inline
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from IPython.display import HTML, Image

#taking a peak at the data (the first 10 digits)
print(digits.data.shape)

digits.data[0:10,:]
(1797, 64)

array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.],
[ 0.,  0.,  0., 12., 13.,  5.,  0.,  0.,  0.,  0.,  0., 11., 16.,
9.,  0.,  0.,  0.,  0.,  3., 15., 16.,  6.,  0.,  0.,  0.,  7.,
15., 16., 16.,  2.,  0.,  0.,  0.,  0.,  1., 16., 16.,  3.,  0.,
0.,  0.,  0.,  1., 16., 16.,  6.,  0.,  0.,  0.,  0.,  1., 16.,
16.,  6.,  0.,  0.,  0.,  0.,  0., 11., 16., 10.,  0.,  0.],
[ 0.,  0.,  0.,  4., 15., 12.,  0.,  0.,  0.,  0.,  3., 16., 15.,
14.,  0.,  0.,  0.,  0.,  8., 13.,  8., 16.,  0.,  0.,  0.,  0.,
1.,  6., 15., 11.,  0.,  0.,  0.,  1.,  8., 13., 15.,  1.,  0.,
0.,  0.,  9., 16., 16.,  5.,  0.,  0.,  0.,  0.,  3., 13., 16.,
16., 11.,  5.,  0.,  0.,  0.,  0.,  3., 11., 16.,  9.,  0.],
[ 0.,  0.,  7., 15., 13.,  1.,  0.,  0.,  0.,  8., 13.,  6., 15.,
4.,  0.,  0.,  0.,  2.,  1., 13., 13.,  0.,  0.,  0.,  0.,  0.,
2., 15., 11.,  1.,  0.,  0.,  0.,  0.,  0.,  1., 12., 12.,  1.,
0.,  0.,  0.,  0.,  0.,  1., 10.,  8.,  0.,  0.,  0.,  8.,  4.,
5., 14.,  9.,  0.,  0.,  0.,  7., 13., 13.,  9.,  0.,  0.],
[ 0.,  0.,  0.,  1., 11.,  0.,  0.,  0.,  0.,  0.,  0.,  7.,  8.,
0.,  0.,  0.,  0.,  0.,  1., 13.,  6.,  2.,  2.,  0.,  0.,  0.,
7., 15.,  0.,  9.,  8.,  0.,  0.,  5., 16., 10.,  0., 16.,  6.,
0.,  0.,  4., 15., 16., 13., 16.,  1.,  0.,  0.,  0.,  0.,  3.,
15., 10.,  0.,  0.,  0.,  0.,  0.,  2., 16.,  4.,  0.,  0.],
[ 0.,  0., 12., 10.,  0.,  0.,  0.,  0.,  0.,  0., 14., 16., 16.,
14.,  0.,  0.,  0.,  0., 13., 16., 15., 10.,  1.,  0.,  0.,  0.,
11., 16., 16.,  7.,  0.,  0.,  0.,  0.,  0.,  4.,  7., 16.,  7.,
0.,  0.,  0.,  0.,  0.,  4., 16.,  9.,  0.,  0.,  0.,  5.,  4.,
12., 16.,  4.,  0.,  0.,  0.,  9., 16., 16., 10.,  0.,  0.],
[ 0.,  0.,  0., 12., 13.,  0.,  0.,  0.,  0.,  0.,  5., 16.,  8.,
0.,  0.,  0.,  0.,  0., 13., 16.,  3.,  0.,  0.,  0.,  0.,  0.,
14., 13.,  0.,  0.,  0.,  0.,  0.,  0., 15., 12.,  7.,  2.,  0.,
0.,  0.,  0., 13., 16., 13., 16.,  3.,  0.,  0.,  0.,  7., 16.,
11., 15.,  8.,  0.,  0.,  0.,  1.,  9., 15., 11.,  3.,  0.],
[ 0.,  0.,  7.,  8., 13., 16., 15.,  1.,  0.,  0.,  7.,  7.,  4.,
11., 12.,  0.,  0.,  0.,  0.,  0.,  8., 13.,  1.,  0.,  0.,  4.,
8.,  8., 15., 15.,  6.,  0.,  0.,  2., 11., 15., 15.,  4.,  0.,
0.,  0.,  0.,  0., 16.,  5.,  0.,  0.,  0.,  0.,  0.,  9., 15.,
1.,  0.,  0.,  0.,  0.,  0., 13.,  5.,  0.,  0.,  0.,  0.],
[ 0.,  0.,  9., 14.,  8.,  1.,  0.,  0.,  0.,  0., 12., 14., 14.,
12.,  0.,  0.,  0.,  0.,  9., 10.,  0., 15.,  4.,  0.,  0.,  0.,
3., 16., 12., 14.,  2.,  0.,  0.,  0.,  4., 16., 16.,  2.,  0.,
0.,  0.,  3., 16.,  8., 10., 13.,  2.,  0.,  0.,  1., 15.,  1.,
3., 16.,  8.,  0.,  0.,  0., 11., 16., 15., 11.,  1.,  0.],
[ 0.,  0., 11., 12.,  0.,  0.,  0.,  0.,  0.,  2., 16., 16., 16.,
13.,  0.,  0.,  0.,  3., 16., 12., 10., 14.,  0.,  0.,  0.,  1.,
16.,  1., 12., 15.,  0.,  0.,  0.,  0., 13., 16.,  9., 15.,  2.,
0.,  0.,  0.,  0.,  3.,  0.,  9., 11.,  0.,  0.,  0.,  0.,  0.,
9., 15.,  4.,  0.,  0.,  0.,  9., 12., 13.,  3.,  0.,  0.]])
def plot_digits(data):
fig, axes = plt.subplots(4, 10, figsize=(10, 4),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
ax.imshow(data[i].reshape(8, 8),
cmap='binary', interpolation='nearest',
clim=(0, 16))
plot_digits(digits.data)
fig.save
pca = PCA(2)  # reduce from 64 to 2 dimensions
projected = pca.fit_transform(digits.data)
print(digits.data.shape)
print(projected.shape)
plt.scatter(projected[:, 0], projected[:, 1],
c=digits.target, edgecolor='none', alpha=0.5,
cmap=plt.cm.get_cmap('Spectral', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar();
(1797, 64)
(1797, 2)

Above, we can see how the digits can be captured by the 2 components. Now, lets try to visualize an example of reconstructing the data using the first 8 components.

def plot_pca_components(x, coefficients=None, mean=0, components=None,
imshape=(8, 8), n_components=8, fontsize=12,
show_mean=True):
if coefficients is None:
coefficients = x

if components is None:
components = np.eye(len(coefficients), len(x))

mean = np.zeros_like(x) + mean

fig = plt.figure(figsize=(1.2 * (5 + n_components), 1.2 * 2))
g = plt.GridSpec(2, 4 + bool(show_mean) + n_components, hspace=0.3)

def show(i, j, x, title=None):
ax = fig.add_subplot(g[i, j], xticks=[], yticks=[])
ax.imshow(x.reshape(imshape), interpolation='nearest')
if title:
ax.set_title(title, fontsize=fontsize)

show(slice(2), slice(2), x, "True")

approx = mean.copy()

counter = 2
if show_mean:
show(0, 2, np.zeros_like(x) + mean, r'$\mu$')
show(1, 2, approx, r'$1 \cdot \mu$')
counter += 1

for i in range(n_components):
approx = approx + coefficients[i] * components[i]
show(0, i + counter, components[i], r'$w_{0}$'.format(i + 1))
show(1, i + counter, approx,
r"${0:.2f} \cdot w_{1}$".format(coefficients[i], i + 1))
if show_mean or i > 0:
plt.gca().text(0, 1.05, '$+$', ha='right', va='bottom',
transform=plt.gca().transAxes, fontsize=fontsize)

show(slice(2), slice(-2, None), approx, "Approx")
return fig

pca = PCA(n_components=8)
Xproj = pca.fit_transform(digits.data)
sns.set_style('white')
fig = plot_pca_components(digits.data[10], Xproj[10],
pca.mean_, pca.components_)

The panel on the left is true image. The panel on the right is the image only looking at the first eight principal components. Notice that the mean, $$\mu$$, is added first since in the original data matrix, we subtracted the column means from the data, we need to add it back to reconstruct the data. The first row represents the eigenvectors. $$w_i$$ represents the $$i$$th column of the eigenvector matrix $$W$$. The row in the middle shows the incremental sum of including each eigenvector multiplied by its score.

Another way that PCA can be used is to remove noise from lower dimensions, lets add random noise to the handwritten digits data, keep the top 50 principal components, then reconstruct the data.

np.random.seed(42)
#add random noise from a normal
noisy = np.random.normal(digits.data, 4)
plot_digits(noisy)
components = pca.transform(noisy)
filtered = pca.inverse_transform(components)
plot_digits(filtered)