MCB112: Biological Data Analysis (Fall 2018)


homework 06:

a mixture of five

One of the many graduate students in the Holmes group, Wiggins, left the lab suddenly (in protest of his stipend of one shilling a day), and left you a mess of his work in progress. He isn't responding to your emails, so you're doing some detective work of your own.

Some of Wiggins' last experiments were single-cell RNA-seqs on differentiated sand mouse embryonic stem cells. You've found records from an experiment in which he was looking at two key early transcription factor genes called Caraway and Kiwi. Previous work had shown that Caraway and Kiwi are expressed at intermediate levels in ES cells, but upon differentiation, their mRNA expression patterns break into four different cell types with all four possible combinations of low vs. high expression of these two TFs.

Wiggins' single cell RNA-seq dataset

You've found a data file where Wiggins had collected mapped read counts for Caraway and Kiwi in 1000 single differentiated ES cells. Because he expected five "cell types" in the data, he used K-means clustering, with K=5, to try to assign each cell to one of the five cell types, and thus estimate the mean expression level (in mapped counts) of the two genes in each cell type, and the relative proportions of the cell types in the population.

However, it's obvious from a figure you found taped into Wiggins' notebook, not to mention the various profanities written therein, that the K-means clustering did not go as hoped. His visualization of his data does show five clear clusters, but his K-means clustering failed to identify them (Figure 1, right).


Figure 1: Wiggins' figure from his notebook, visualizing his K-means clustering of the 1000 cells in his single cell RNA-seq experiment, for K=5. Black stars indicate the five fitted centroids from his best K-means clustering, and colors indicate the assignments of the 1000 cells to the 5 clusters.

You can see in his notebook that he understood that K-means is prone to local minima, so it's not as if this is a one-off bad solution. His notes indicate that he selected the best of 20 solutions, starting from different random initial conditions. You find the following data table, and a note that this solution had a best "final totdist = 501412.6", of 20 solutions with "totdist" from 501412.6 to 514384.1.

cluster fraction mean counts: Caraway Kiwi
0 0.1280 944.7 1796.4
1 0.0550 3548.9 2336.8
2 0.0990 2460.5 263.3
3 0.6670 252.3 223.2
4 0.0510 726.2 4141.7

Interestingly, it looks like all Wiggins was trying to do was to get K-means clustering to work. He must have also had the ES cells marked with reporter constructs that unambiguously labeled each of their cell types, because his data file includes a column for the true cell type (0-4), so the true clustering is known in these data (Figure 2, right).


Figure 2: Another figure from Wiggins' notebook, showing the true clustering of the 1000 cells.

1. reproduce Wiggins' K-means result

Write a standard K-means clustering procedure. Use it to cluster Wiggins' data into K=5 clusters. Plot the results, similar to his figure. You should be able to reproduce a similarly bad result.

You'll want to run the K-means algorithm multiple times and choose the best. What is a good statistic for choosing the "best" solution for K-means? You should be able to reproduce Wiggins' "totdist" measure.

Why is K-means clustering producing this result, when there are clearly five distinct clusters in the data?

2. mixture negative binomial fitting

Now you're going to use what you've learned about mixture models, and about the negative binomial distribution for RNA-seq data.

Write an expectation maximization algorithm to fit a mixture negative binomial distribution to Wiggins' data, for Q=5 components in the mixture.

Assume there is a common dispersion \(\phi = 0.3\). This means that all you need to re-estimate in the EM algorithm are the means \(\mu\) and mixture coefficients \(\pi\) for each mixture component.

Like K-means, EM is a local optimizer, so you will want to run your EM algorithm from multiple initial conditions, and take the best one. What is an appropriate statistic for choosing the "best" fit?

What are the estimated mean expression levels of Caraway and Kiwi in the five cell types, and the relative proportions of each cell type in the 1000 cells?

Visualize your result in a plot similar to Wiggins'.

3. find a simple fix for K-means

Suggest a simple fix for the problem in applying a standard K-means clustering algorithm to Wiggins' single cell RNA-seq data. Implement the fix, re-run the K-means clustering, pick a "best" solution; report and visualize it.

turning in your work

Email a Jupyter notebook page (a .ipynb file) to mcb112.teachingfellow@gmail.com. Please name your file <LastName><FirstName>_MCB112_<psetnumber>.ipynb; for example, mine would be EddySean_MCB112_06.ipynb.


hints

take-home lessons