## homework 11: 2001, a space problem

After you presented your work from several weeks ago on Lestrade’s K-means clustering at lab meeting, Watson arrives at your desk with a problem of his own. He’s been working with a data set from the sand mouse retina. Under the microscope, he sees, clear as day, 8 morphologically distinct cell types. But in his RNA-seq data, try as he may, he can’t seem to find any intelligible differences in their gene expression patterns. He’s convinced that the differences must lie in some of the 2001 neuron-specific sand mouse genes, but he’s been unable to identify a meaningful clustering by K-means or other methods. He implores you to take a look at his counts data, which is based on 200 single cell RNA-seq experiments and in tidy format, with each gene as a column and each observation as a row. In the figure below, he’s showing a heat map in which he’s drawn horizontal white lines to separate his k=8 K-means clusters.

You have a hard time making heads or tails of Watson’s K-means heat map, which seems like a mess, and doesn’t seem like the best way to plot clusters based on high-dimensional data. You’re most familiar with the 2D scatter plots Lestrade used in his notebook, where it was easy to visualize how cells were clustered. If you can visualize the data similarly, you’ll have an easier time homing in on Watson’s problem. But you’ll have to figure out a way to project Watsons’s 2001-D data set down into 2-D.

### 1. reproduce Watsons’s K-means result

Modify the K-means clustering procedure you wrote for Lestrade’s data so that it works in 2001 dimensions, not just 2. Run a reasonable number of iterations (20-100, or even better, test for convergence), starting from several different initializations; report the lowest total distance (best clustering) you find for K=8. It should be close to what Watson found using k = 8 clusters: 4159.9 (sum of the squared distance = 87161.9). Note that he was using the “fixed” K-means method from part 3 of HW6, with his data represented as log counts.

### 2. reduce the dimensionality

Write a Python function that uses singular value decomposition to find the principal components of the data set.

Plot all 200 cells in 2D expression space using their projections onto the first two principal axes.

Was Watson right to expect 8 clusters? Plot the eigenvalues for each component, and justify why you’re pretty sure it would be hard to find any other clusters in the data set. The eigenvalues from a simulated negative control data set, where there were no cell types and no correlations between any of the genes, should factor into your answer.

Based on the eigenvector loadings, how many genes appear to influence cell type identity?

### 3. check the K

Plot the data in 2D principal component space, and color each point according to the cluster identities from part 1. You should find the K-means is really missing the mark.

Offer an explanation of what might be going wrong, and find a way to cluster so that each cell appears properly assigned in PC space.

### 4. reconstruct the expression patterns

Reconstruct the original data set using only the projected data and eigenvectors for the first 2 principal components. Visualize the data using a heat map. Do the clusters now look more obvious? Why or why not?

Email a Jupyter notebook page (a .ipynb file) to mcb112.teachingfellow@gmail.com. Please name your file <LastName><FirstName>_MCB112_<psetnumber>.ipynb; for example, mine would be EddySean_MCB112_11.ipynb.

### notes

Remember that when using SVD, you want to center your data first by subtracting the mean of each column (variables) from the data. Without centering before SVD, the first principal component is merely proportional to the mean of the data over rows, and other components may be less interpretable as well. When using standard eigendecomposition, the covariance matrix, by virtue of its definition, will look the same on centered and non-centered data.

It’s sometimes recommended to standardize, or whiten, your data by dividing each column by its standard deviation. This process ensures that variables with large intrinsic variance do not dominate components, which is essential when performing PCA with variables that do not share units. In other situations, however, the recommendation is less clear-cut, as differences in intrinsic variance between variables might be important features of the data. Indeed, in Watson’s case, data whitening results in the loss of a few otherwise distinguishable clusters in PC space. Nevertherless, there is some risk in performing PCA on raw, non-standardized data, as significant covariations could be relegated to components with small eigenvalues (and thus ignored).

### hints

You might find that when using K-means in 2001 dimensional space, K-means sometimes creates an orphan cluster, a cluster that doesn’t have any points assigned to it. If you encounter this, discard that K-means iteration and try a new initialization of random centroids.

It can be tricky to keep the dimensions of your data, eigenvectors, and principal component “scores” straight. Using standard eigen decomposition with a $M \times N$ data matrix, you should end up with $N$ eigenvectors of length $N$. When using singular value decomposition, however, the eigenvector matrix ends up truncated by the number of observations (when $% $), so you should end up with $M$ eigenvectors of length $N$. The principal component “scores” are the projections of each observation onto an individual principal component axis. So for a $M \times N$ data matrix, you should wind up with a $M \times M$ matrix of scores when using singular value decomposition.

The eigenvalues and eigenvectors might not come out sorted from your SVD. To properly rank principal components as 1, 2, …, p, you’ll have to make sure to sort the eigenvalues in descending order (together with their corresponding eigenvectors).