MCB112: Biological Data Analysis (Fall 2018)


homework 10:

the adventure of the moonlighting genes

Previous work from Irene Adler's laboratory has established that there seem to be a small number of gene batteries (modules of co-expressed genes) that are mixed and matched at different levels to specify the basic morphological properties of different sand mouse neuron cell types. These batteries involve about 100 different genes that her laboratory has identified. How many modules there are, and exactly which genes belong to which module, remain unknown.

Adler believes these 100 genes comprise three to six co-expressed gene batteries. She also believes that the batteries may share a few genes, and this overlap -- where the same gene is playing different functions in different contexts -- will be biologically informative.

non-negative matrix factorization & sand mouse neural cell types

The lab has collected RNA-seq data (as mapped read counts) for 60 different purified neuronal cell types. These data, as a simple whitespace-delimited table, are available here.

She's just read two papers, [Kim and Tidor 2003] and [Brunet et al 2004], that suggest that non-negative matrix factorization (NMF) is capable of identifying gene batteries, including shared genes between batteries.

You're visiting the Adler lab for a few weeks, sent by your PI Holmes in the hope of establishing a collaborative relationship between the groups. She's not yet sure what to make of you, or of Holmes. She asks you if you can delve into the 1999 Lee and Seung Nature paper that popularized NMF and introduced an elegant mathematical algorithm to solve it. You say sure, you've taken MCB112; how hard could it be?

You set out to study the Lee and Seung paper, understand the derivation of their algorithm, implement NMF, and understand how it works -- and then, to analyze the Adler lab's data.

1. write a script that simulates positive control data

Using the generative model assumed by NMF, write a script that generates synthetic data for N genes and M experiments, generated from R underlying gene batteries.

2. implement nonnegative matrix factorization

Implement NMF, following the description in [Lee and Seung (1999)].

Apply it to synthetic datasets that you generate, varying the parameters of your synthetic data. What conclusions can you draw about how well NMF reconstructs the known gene batteries in your synthetic data?

3. analyze the Adler data

Apply your NMF analysis to the Adler dataset.

turning in your work

Email a Jupyter notebook page (a .ipynb file) to mcb112.teachingfellow@gmail.com. Please name your file <LastName><FirstName>_MCB112_<psetnumber>.ipynb; for example, mine would be EddySean_MCB112_10.ipynb.


hints