### PCA class review¶

With connection to the lecture notes, lets review the general steps of what we need to do for PCA, below also may be helpful steps for tackling the homework. Let $M$ be the number of observations, and $D$ be the number of dimmensions. Note that for the notes below, we assume $D>M$:

$\textbf{1}$: Start with data matrix $Y$, which is of size $M$ by $D$ (it may help to take the log of the matrix for PCA to work better).

$\textbf{2}$: Subtract column means from $Y$ to get $X$.

$\textbf{3}$: Perform decomposition of matrix $X$

When one performs decomposes $X$, you can either use the eigendecomposition (if you have a square matrix) or singular value decomposition (svd, if you have a square or rectangular matrix, and this is what you will use in your homework):

$\textbf{Eigendecomposition}$ (np.linalg.eigh):

Need to first calculate covariance matrix $\mathbf{A}$ from $X$. $A=\frac{1}{N}\sum_{j=1}^{N}x_jx_j^T$. \begin{align} \mathbf{A} = \mathbf{G} \Lambda \mathbf{G^\top} \end{align}

$G$ in this case contains your eigenvectors and $\Lambda$ is a square matrix that contains your eigenvalues.

$\textbf{Singular value decomposition}$ (np.linalg.svd, again what you will use for your homework):

\begin{align} \mathbf{X} = \mathbf{UZW^\top} \end{align}

Note that in SVD, you decompose the original matrix, not the covariance matrix, and one of the outputs is a matrix of your eigenvectors, $\mathbf{W^\top}$. And the matrix $\mathbf{Z}$ is a diagonalized matrix that relates to the diagonalized matrix of eigenvalues as follows,

\begin{align} \Lambda = \frac{\mathbf{Z^TZ}}{D-1} \end{align}

where $p$ is the number of observations in your $p \times n$ matrix. Matrix $\mathbf{U}$ is in principle related to the projections of the data onto the principal axes. For SVD, when we refer to eigenvalues and eigenvectors, we also are still referring to the eigenvalue/eigenvectors of matrix $\mathbf{A}$, except in this case we actually don't need to compute $\mathbf{A}$ for SVD. $Z^TZ$ is a square matrix with each of the original eigenvalues squared along the main diagonal.

Note that when you run np.linalg.svd, you may get a $D$ by $D$ matrix for $W^T$. The first $M$ rows contain your eigenvectors. The last $M+1$ to $D$ rows do not matter. This is due to $Z$ being represented as a diagnolized rectangular matrix with eigenvalues along the main diagonal for the first $M$ columns and columns with all zeros for the last $M+1$ to $D$ columns. Thus, if you do out the matrix multiplication, the last $M+1$ to $D$ rows from $W^T$ would have no effect on the resulting matrix. For a more in depth explanation, see: https://www.mathworks.com/content/dam/mathworks/mathworks-dot-com/moler/eigs.pdf

np.linalg.svd also returns a list of eigenvalues instead of the $Z$ matrix. What its doing is pulling out the diagnoal entries from the $Z$ matrix.

$\textbf{5}$: Taking a look at what your eigenvalues, eigenvectors (or principal components) look like for PCA diagnostics. First, sort the eigenvalues, eigenvectors to ensure that the eigenvector with the largest eigenvalue is first column in matrix $W$, then the eigenvector with the second largest eigenvalue is the next column, then keep going down... If you use np.linalg.svd, then the eigenvalues and eigenvectors are sorted for you so you don't have to sort the eigenvalues first. Plot the eigenvalues and decide how many PCs (eigenvectors) to keep. Assume we keep $L$ ($L\leq D$) components. Also, examine the $L$ eigenvectors retained to see the contribution of each dimmension to each eigenvector. For example, each eigenvector is of length $D$, with a contribution from each of the individual dimmensions $1...D$. It may help to plot the eigenvectors to visualize the contributions.

We will call $W$ truncated to the first $L$ eigenvectors as the matrix $V$. So in this case, we reduce $W$ to the matrix $V$ which is of size $D$ by $L$.

$\textbf{6}$: After extracting lambdas, perform projection as follows:

$\mathbf{S}=\mathbf{XV}$. $\mathbf{X}$ is of size $M$ by $D$. $\mathbf{V}$ is $D$ by $L$, thus $\mathbf{S}$ is size $M$ by $L$.

$\textbf{7}$: Reconstruct the data. $\mathbf{s}_i$ is the $i$th column from $\mathbf{S}$ and has length of $M$.

\begin{align} \hat{\mathbf{X}} = \sum_i \mathbf{s}_i \mathbf{v}_{i}^\top \end{align}

Add in the column means to $\hat {\mathbf{X}}$ to get $\hat {\mathbf{Y}}$.

The above is what are essential steps that can help with the homework (and the steps copied mostly straight from the notes with some notation changes). Below, we derive another way of thinking about thinking in terms of reconstruction as linear combinations of your eigenvectors, which will lead up to us discussing of viewing PCA as minimizing the reconstruction error.

Note that for a single $\mathbf{s}_i \mathbf{v}_{i}^T$ term from the summation, we have the following (a $M$ by $D$ matrix) if we do out the matrix multiplication (note that $s_{ij}$ corresponds to the $j$th entry from vector $\mathbf{s}_i$): $$\begin{bmatrix} s_{i1}v_{i1} & s_{i1}v_{i2} & ... \\ s_{i2}v_{i1} & s_{i2}v_{i2} & .... \\ .... & .... & .... \end{bmatrix}$$

If we take the transpose, we have (now we have a $D$ by $M$ matrix):

$$\begin{bmatrix} s_{i1}v_{i1} & s_{i2}v_{i1} & ... \\ s_{i1}v_{i2} & s_{i2}v_{i2} & .... \\ .... & .... & .... \end{bmatrix}$$

This is equivalent to:
$$\begin{bmatrix} s_{i1}\mathbf{v}_{i} & s_{i2}\mathbf{v}_{i} & s_{i3}\mathbf{v}_{i} \end{bmatrix}$$

$\mathbf{v}_{i}$ is a vector. Each column represents the reconstruction of one observation, $x_j$. (so $x_j$ is the $j$th column from $X^T$).

Noting what we showed above, and if we take the matrix $\mathbf{S}$, another way to think about reconstruction is as follows, instead let $s^*_{j}$ be the $j$th column of $\mathbf{S}^T$. For each $x_j$, $j=1...M$, we can reconstruct it as $\hat{x_j}$ as follows: $$\hat{x_j}=\mathbf{V}s^*_{j}=\mathbf{v}_1s^*_{j1}+\mathbf{v}_2s^*_{j2}+...+\mathbf{v}_Ls^*_{jL}$$

In this manner, I am reconstructing each individual observation $x_j$ by a linear combination of the eigenvector vectors, weighed by the corresponding score for each eigenvector. In this reconstruction, the information lost is contained in the remaining eigenvectors not used (i.e. the eigenvectors from $L+1$... onwards).

### PCA as minimizing reconstruction error¶

With this knowledge in mind, lets now look at the following:

Let $X$ be a mean centered $M$ by $D$ matrix with $M$ observations and each observation with $D$ dimmensions. Let $x_j$ be a column vector from $X^T$ (so $x_j$ is a single observation of size $D$). Let $V$ be of dimmension $D$ by $L$ ($L\leq D$). Mathematically, PCA minimizes the following reconstruction error:

$$J(V,Z)=\frac{1}{M}\sum_{j=1}^M ||x_j-\hat{x_j}||^2$$

where $\hat{x_j}=\mathbf{V}s^{*}_{j}$. $s^{*}_j$ is the $j$th column from $\mathbf{S}^T$, where $\mathbf{S}=\mathbf{XV}$. The optimal solution, $\hat{V}$ is found to be $\hat{V}=\Lambda$, $\Lambda$ contains the $d$ eigenvectors of the $d$ largest eigenvectors of the empircal covariance matrix $A$, $A=\frac{1}{N}\sum_{j=1}^{N}x_jx_j^T$. See Kevin Murphy's book Machine Learning: A Probabilistic Perspective for a proof.

Note again, that we have the following:

$$\hat{x_j}=\mathbf{V}s^*_{j}=\mathbf{v}_1s^*_{j1}+\mathbf{v}_2s^*_{j2}+...+\mathbf{v}_Ls^*_{jL}$$

where $\mathbf{v}_j$ is the $j$th eigenevector, so this is just each eigenvector multiplied by its corresponding score. This can be thought of as using the best representation of data as a linear combination $d$ eigenvectors.

### PCA Visual Examples¶

Now, we will try to understand PCA better through some examples/extra visualizations.

### Note also for this part that I use a different PCA command than you will use for your homework. I cheat by just using a straightforward PCA function, but you guys will need to code PCA using np.linalg.svd.¶

We can understand PCA by considering the following visualization. What is PCA doing? it is "choosing the best axes" to define our data in a lower dimmesion.

In [16]:
#this image was taken from
#https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues
Image(url="Q7HIP.gif")

Out[16]:

The red line is the projection of each point to the black line. If the original 2d data were to be reconstructed using one principal component, it would lie along the black line rotating. Here, we see that the reconstruction error is minimized when the black line aligns with the purple . Note also that when the black line aligns with the purple line, the variance, ("or spread of the projected points along the black line") along the black line is maximized.

In the discussions above, we only discussed the eigenvectors, what are eigenvalues useful for? They can be used to capture the dimmenions with the most variance. Lets play with a very simple simulation to see the relation of the eigenvalue to the actual data. Change the variance (sigma_y and sigma_x) to a larger or smaller values to see how the eigenvalues change.

In [5]:
#parts of code was adapted from the
#python datascience  handbook:
#https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html
#note also I'm using a different PCA function than you will use for your homework

%matplotlib inline
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from IPython.display import HTML, Image

num_data=1000
all_data=np.zeros((num_data, 2))

mu=0
sigma_x=1
all_data[:,0]=np.random.normal(mu, sigma_x, num_data)

sigma_y=0.1
all_data[:,1]=np.random.normal(mu, sigma_y, num_data)

plt.scatter(all_data[:,0], all_data[:,1], alpha=0.5)
plt.xlim(-5,5)
plt.ylim(-5,5)

pca = PCA(2)  # project from 64 to 2 dimensions
projected = pca.fit(all_data)
eigen_values=pca.explained_variance_

fig, ax = plt.subplots()
ax.plot(range(2), eigen_values, 'o')
ax.set_xlabel('index of eigenvalue')
ax.set_ylabel('eigenvalue')

ax.set_xlim([-1, 10])
ax.set_ylim([-1, 2])

Out[5]:
(-1, 2)

### Handwritten digits example ¶

Here, we will look at an example using handwritten digits from 0 to 9. There will be a total of 1797 hand written digits, with a dimmension of 64. Each of the 64 dimmensions represents a pixel value. Lets take a preliminary look at the data.

In [49]:
#parts of code was adapted from the python datascience  handbook:
#https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html

#taking a peak at the data (the first 10 digits)
print(digits.data.shape)

digits.data[0:10,:]

(1797, 64)

Out[49]:
array([[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.,   0.,   0.,  13.,
15.,  10.,  15.,   5.,   0.,   0.,   3.,  15.,   2.,   0.,  11.,
8.,   0.,   0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.,   0.,
5.,   8.,   0.,   0.,   9.,   8.,   0.,   0.,   4.,  11.,   0.,
1.,  12.,   7.,   0.,   0.,   2.,  14.,   5.,  10.,  12.,   0.,
0.,   0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.],
[  0.,   0.,   0.,  12.,  13.,   5.,   0.,   0.,   0.,   0.,   0.,
11.,  16.,   9.,   0.,   0.,   0.,   0.,   3.,  15.,  16.,   6.,
0.,   0.,   0.,   7.,  15.,  16.,  16.,   2.,   0.,   0.,   0.,
0.,   1.,  16.,  16.,   3.,   0.,   0.,   0.,   0.,   1.,  16.,
16.,   6.,   0.,   0.,   0.,   0.,   1.,  16.,  16.,   6.,   0.,
0.,   0.,   0.,   0.,  11.,  16.,  10.,   0.,   0.],
[  0.,   0.,   0.,   4.,  15.,  12.,   0.,   0.,   0.,   0.,   3.,
16.,  15.,  14.,   0.,   0.,   0.,   0.,   8.,  13.,   8.,  16.,
0.,   0.,   0.,   0.,   1.,   6.,  15.,  11.,   0.,   0.,   0.,
1.,   8.,  13.,  15.,   1.,   0.,   0.,   0.,   9.,  16.,  16.,
5.,   0.,   0.,   0.,   0.,   3.,  13.,  16.,  16.,  11.,   5.,
0.,   0.,   0.,   0.,   3.,  11.,  16.,   9.,   0.],
[  0.,   0.,   7.,  15.,  13.,   1.,   0.,   0.,   0.,   8.,  13.,
6.,  15.,   4.,   0.,   0.,   0.,   2.,   1.,  13.,  13.,   0.,
0.,   0.,   0.,   0.,   2.,  15.,  11.,   1.,   0.,   0.,   0.,
0.,   0.,   1.,  12.,  12.,   1.,   0.,   0.,   0.,   0.,   0.,
1.,  10.,   8.,   0.,   0.,   0.,   8.,   4.,   5.,  14.,   9.,
0.,   0.,   0.,   7.,  13.,  13.,   9.,   0.,   0.],
[  0.,   0.,   0.,   1.,  11.,   0.,   0.,   0.,   0.,   0.,   0.,
7.,   8.,   0.,   0.,   0.,   0.,   0.,   1.,  13.,   6.,   2.,
2.,   0.,   0.,   0.,   7.,  15.,   0.,   9.,   8.,   0.,   0.,
5.,  16.,  10.,   0.,  16.,   6.,   0.,   0.,   4.,  15.,  16.,
13.,  16.,   1.,   0.,   0.,   0.,   0.,   3.,  15.,  10.,   0.,
0.,   0.,   0.,   0.,   2.,  16.,   4.,   0.,   0.],
[  0.,   0.,  12.,  10.,   0.,   0.,   0.,   0.,   0.,   0.,  14.,
16.,  16.,  14.,   0.,   0.,   0.,   0.,  13.,  16.,  15.,  10.,
1.,   0.,   0.,   0.,  11.,  16.,  16.,   7.,   0.,   0.,   0.,
0.,   0.,   4.,   7.,  16.,   7.,   0.,   0.,   0.,   0.,   0.,
4.,  16.,   9.,   0.,   0.,   0.,   5.,   4.,  12.,  16.,   4.,
0.,   0.,   0.,   9.,  16.,  16.,  10.,   0.,   0.],
[  0.,   0.,   0.,  12.,  13.,   0.,   0.,   0.,   0.,   0.,   5.,
16.,   8.,   0.,   0.,   0.,   0.,   0.,  13.,  16.,   3.,   0.,
0.,   0.,   0.,   0.,  14.,  13.,   0.,   0.,   0.,   0.,   0.,
0.,  15.,  12.,   7.,   2.,   0.,   0.,   0.,   0.,  13.,  16.,
13.,  16.,   3.,   0.,   0.,   0.,   7.,  16.,  11.,  15.,   8.,
0.,   0.,   0.,   1.,   9.,  15.,  11.,   3.,   0.],
[  0.,   0.,   7.,   8.,  13.,  16.,  15.,   1.,   0.,   0.,   7.,
7.,   4.,  11.,  12.,   0.,   0.,   0.,   0.,   0.,   8.,  13.,
1.,   0.,   0.,   4.,   8.,   8.,  15.,  15.,   6.,   0.,   0.,
2.,  11.,  15.,  15.,   4.,   0.,   0.,   0.,   0.,   0.,  16.,
5.,   0.,   0.,   0.,   0.,   0.,   9.,  15.,   1.,   0.,   0.,
0.,   0.,   0.,  13.,   5.,   0.,   0.,   0.,   0.],
[  0.,   0.,   9.,  14.,   8.,   1.,   0.,   0.,   0.,   0.,  12.,
14.,  14.,  12.,   0.,   0.,   0.,   0.,   9.,  10.,   0.,  15.,
4.,   0.,   0.,   0.,   3.,  16.,  12.,  14.,   2.,   0.,   0.,
0.,   4.,  16.,  16.,   2.,   0.,   0.,   0.,   3.,  16.,   8.,
10.,  13.,   2.,   0.,   0.,   1.,  15.,   1.,   3.,  16.,   8.,
0.,   0.,   0.,  11.,  16.,  15.,  11.,   1.,   0.],
[  0.,   0.,  11.,  12.,   0.,   0.,   0.,   0.,   0.,   2.,  16.,
16.,  16.,  13.,   0.,   0.,   0.,   3.,  16.,  12.,  10.,  14.,
0.,   0.,   0.,   1.,  16.,   1.,  12.,  15.,   0.,   0.,   0.,
0.,  13.,  16.,   9.,  15.,   2.,   0.,   0.,   0.,   0.,   3.,
0.,   9.,  11.,   0.,   0.,   0.,   0.,   0.,   9.,  15.,   4.,
0.,   0.,   0.,   9.,  12.,  13.,   3.,   0.,   0.]])

Each row is an image with 64 pixels. The values of the 64 pixels are in each column. Lets take a look at 40 of the digits.

In [2]:
def plot_digits(data):
fig, axes = plt.subplots(4, 10, figsize=(10, 4),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
ax.imshow(data[i].reshape(8, 8),
cmap='binary', interpolation='nearest',
clim=(0, 16))
plot_digits(digits.data)


Now, lets see what it looks like when we project from 64 dimmensions to 2 dimmensions using PCA.

In [3]:
pca = PCA(2)  # project from 64 to 2 dimensions
projected = pca.fit_transform(digits.data)
print(digits.data.shape)
print(projected.shape)
plt.scatter(projected[:, 0], projected[:, 1],
c=digits.target, edgecolor='none', alpha=0.5,
cmap=plt.cm.get_cmap('spectral', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar();

(1797, 64)
(1797, 2)

/Users/jxue/anaconda/lib/python3.6/site-packages/matplotlib/cbook.py:136: MatplotlibDeprecationWarning: The spectral and spectral_r colormap was deprecated in version 2.0. Use nipy_spectral and nipy_spectral_r instead.
warnings.warn(message, mplDeprecation, stacklevel=1)


Above, we can kind of see how the digits can be captured by the 2 components. Now, lets try to visualize an example of reconstructing the data using the first 8 components.

In [11]:
def plot_pca_components(x, coefficients=None, mean=0, components=None,
imshape=(8, 8), n_components=8, fontsize=12,
show_mean=True):
if coefficients is None:
coefficients = x

if components is None:
components = np.eye(len(coefficients), len(x))

mean = np.zeros_like(x) + mean

fig = plt.figure(figsize=(1.2 * (5 + n_components), 1.2 * 2))
g = plt.GridSpec(2, 4 + bool(show_mean) + n_components, hspace=0.3)

def show(i, j, x, title=None):
ax = fig.add_subplot(g[i, j], xticks=[], yticks=[])
ax.imshow(x.reshape(imshape), interpolation='nearest')
if title:
ax.set_title(title, fontsize=fontsize)

show(slice(2), slice(2), x, "True")

approx = mean.copy()

counter = 2
if show_mean:
show(0, 2, np.zeros_like(x) + mean, r'$\mu$')
show(1, 2, approx, r'$1 \cdot \mu$')
counter += 1

for i in range(n_components):
approx = approx + coefficients[i] * components[i]
show(0, i + counter, components[i], r'$w_{0}$'.format(i + 1))
show(1, i + counter, approx,
r"${0:.2f} \cdot w_{1}$".format(coefficients[i], i + 1))
if show_mean or i > 0:
plt.gca().text(0, 1.05, '$+$', ha='right', va='bottom',
transform=plt.gca().transAxes, fontsize=fontsize)

show(slice(2), slice(-2, None), approx, "Approx")
return fig

pca = PCA(n_components=8)
Xproj = pca.fit_transform(digits.data)
sns.set_style('white')
fig = plot_pca_components(digits.data[10], Xproj[10],
pca.mean_, pca.components_)


The panel on the left is true image. The panel on the right is the image only looking at the first eight principal components. Notice that the mean, $\mu$, is added first since in the original data matrix, we subtracted the column means from the data, we need to add it back to reconstruct the data. The first row represents the eigenvectors. $w_i$ represents the $i$th column of the eigenvector matrix $W$. The row in the middle shows the incremental sum of including each eigenvector multiplied by its score. Note that this form of reconstruction is analagous to our discussions earlier of viewing each reconstructed datapoint as a linear combination of the eigenvectors.

Another way that PCA can be used is to remove noise from lower dimmensions, lets add random noise to the handwritten digits data, keep the top 50 principal components, then reconstruct the data.

In [17]:
np.random.seed(42)
#add random noise from a normal
noisy = np.random.normal(digits.data, 4)
plot_digits(noisy)

In [18]:
pca = PCA(0.50).fit(noisy)
pca.n_components_

Out[18]:
12

Reconstruction

In [19]:
components = pca.transform(noisy)
filtered = pca.inverse_transform(components)
plot_digits(filtered)


We see that after data reconstruction, it looks as if the noise is removed.

### Connection to NMF¶

For PCA's relation to NMF, I would encourage you to reread Lee and Seung's paper:

https://www.nature.com/articles/44565

In short, both PCA and NMF construct the $W$ matrix. In PCA, this is known as your eigenvector matrix. In addition, within reconstruction, each reconstructed point is a linear combination of the columns of your $W$ matrix - the same is likewise for NMF. The exception is that in PCA the columns of $W$ are required to be orthogonal to each other, whereas in NMF they don't need to be. Also, in NMF, $W$ is required to have positive entries only, whereas in PCA the eigenvector can have negative entries.