MCB112: Biological Data Analysis (Fall 2017)

homework 04:

a plague of sand mice

A strange new disease is sweeping through the sand mouse population. Your lab is trying to obtain genomic and transcriptomic sequence of the pathogen, by sequencing RNA from infected mice. Pathogen sequences make up a small fraction of the RNA-seq reads you obtain (1%), mixed with a large majority of sand mouse sequences (99%). You and Moriarty are both working on developing a method for classifying individual RNA-seq reads as “pathogen” versus “sand mouse”, based on the read’s sequence composition alone.

The lab has collected datasets that are available to you for training and testing methods:

Moriarty’s method

Moriarty is already crowing about his secret proprietary machine learning method. He shows the following (apparently spectacular) result:

Figure 1

The histogram on the left shows results of scoring the 10,000 pathogen sequences in pathogen.fa, compared to scoring 10,000 random sequences of uniform base composition (which are here if you want them). The ROC plot on the right shows that his method achieves perfect discrimination… Moriarty says.

1. test Moriarty’s method yourself

After some prying on your part, Moriarty reluctantly discloses that his secret proprietary machine learning method is to score +1 for each A or T, and -1 for each C or G.

You note that Moriarty showed results for discriminating against random sequences (of uniform base composition), but the problem is to discriminate against sand mouse sequences.

Implement Moriarty’s secret proprietary machine learning method. Test it on the sequences in pathogen.fa and sandmouse.fa. Make a plot of your own, showing a histogram and a ROC plot like in Moriarty’s figure, but for performance in discriminating pathogen from sand mouse sequences, using matplotlib.

What do you conclude? Why are your results different from Moriarty’s?

2. make your own method

You think you can do better, using higher order Markov models.

Because you’re estimating the parameters of your models from the example data, you should train and test on separate data sets. Use half of each sequence set for training, and half for testing.

3. how good is your method?

Suppose the lab needs to achieve 90% sensitivity (i.e. detecting 90% of pathogen sequences).

Remember that in an RNA-seq sample from an infected sand mouse, 1% of the reads are from the pathogen, and 99% of the reads are from the mouse. Using your estimates above, if you use your method to classify reads, what proportion of the reads that you label as “pathogen” are actually false positives from the sand mouse? (This is your false discovery rate, FDR.)

(Use Python code to answer these questions, using the results of your testing above.)

turning in your work

Email a Jupyter notebook page (a .ipynb file) to Please name your file <LastName><FirstName>_MCB112_<psetnumber>.ipynb; for example, mine would be EddySean_MCB112_02.ipynb.