pset 08: the adventure of the chimeric reads

Moriarty has invented an amazing new single-molecule DNA sequencing method based on quantum 'telepod' technology from his collaborator Seth Brundle. In one of his first demonstration experiments, he sequenced a mixture of DNA from two phage, phage T4 and the SEA-PHAGE AluminumJesus. Unfortunately, every read seems to be a chimera composed of a piece of T4 DNA and a piece of AluminumJesus DNA (in either order: T4+AluminumJesus or AluminumJesus+T4). While Moriarty looks into what went wrong with his telepods, you decide to use his read data to practice what you've learned this week about HMMs.

Here are 20 of Moriarty's chimeric reads, in FASTA format:

moriarty-reads.fa (20 sequences, L=200 each)

By reading the logs of the telepod, your enterprising young assistant Wiggins has been able to decipher the ground truth for each read: which genome each segment came from, and where the breakpoint was. You can download this ground truth data here, as a tab-delimited file:

moriarty-reads.truth

Each (non-comment) line of this file has five fields: <readname> <L> <k> <g0> <g1>:

readname: name of the read
L: length of the read (all 200, in this example)
k: breakpoint of the chimera. Residues 1..k-1 are from genome g0. Residues k..L are from genome g1.
g0: which phage the first chimeric segment comes from: 1=AluminumJesus, 2=T4.
g1: which phage the second chimeric segment comes from.

The sequences are in the same order in the two files.

The HMM (is given to you)

You know the residue composition of AluminumJesus DNA (about 67% GC) and T4 DNA (about 35% GC). You also know that the mean chimeric segment length from each source genome is 100, since the reads are L=200 long and are composed of two segments from each of the two source genomes.

This is sufficient information to specify a simple HMM with two main states, one for each source genome, plus a special state 0 for start/end. State 1 generates residues with AluminumJesus composition, and has a self-loop of probability $0.99$, a transition to state 2 (T4 composition) with probability $0.005$, and a terminal transition of probability $0.005$ to the special state 0. State 2 is analogous, but with T4 composition. State 0 can begin in either state 1 or 2 with equal probability 0.5. In code, that is:

I really wanted to have you do Baum-Welch EM to estimate your own HMM, instead of specifying one, but Baum-Welch EM is computationally expensive. I wasn't able to come up with a good problem that could be solved in reasonable time.

def make_hmm():
     p = 1/200   # Mean seg length of 100. 

     t = np.array([[ 0.0, 1/2,   1/2 ],     # state 0 = start/end
                   [ p, 1-2*p,     p ],     # state 1 = AluminumJesus
                   [ p,     p, 1-2*p ]] )   # state 2 = T4

     e = np.array([[   1/4,   1/4,   1/4,   1/4 ],    # unused; start/end state doesn't emit
                   [ 0.166, 0.334, 0.334, 0.166 ],    # AluminumJesus residue composition
                   [ 0.323, 0.177, 0.177, 0.323 ]] )  # T4 residue composition
     return (t,e)

1. Implement the standard HMM algorithms

Implement four algorithms for HMM analysis of an input sequence:

Viterbi: find an optimal state path $\pi$ (and its log probability $\log P(x, \pi \mid \theta)$), given the HMM parameters $\theta$ and a sequence $x$.
Forward: calculate the Forward dynamic programming matrix and the total log probability $\log P(x \mid \theta)$, given the HMM parameters $\theta$ and a sequence $x$.
Backward: calculate the Backward dynamic programming matrix, given the HMM parameters $\theta$ and a sequence $x$. (Backward also gives you the total log probability $\log P(x \mid \theta)$, which you already got from Forward; you can check that they're equal within machine roundoff error.)
Decoding: given the Forward and Backward matrices, do posterior decoding to calculate $P(\pi_i = k \mid x, \theta)$, the probability that residue $i$ was generated by state $k$.

Implement these functions to be able to take an HMM of any number of states, not just this particular $M=2$ HMM. The HMM is specified by an $(M+1) \times (M+1)$ transition probability matrix $t_{kj}$ and an $(M+1) \times 4$ emission probability matrix $e_k(a)$, where state 0 is the special begin/end state. State 0 can transition to any main state $1..M$ to begin, and any main state $1..M$ can transition to state 0 to end.

Additionally, write a function that takes the posterior decoding matrix and uses it to infer a state path. The best (most probable) state assignment for each position $i$ is $\hat{k} = \max_k P(\pi_i = k \mid x, \theta)$. Set a threshold (like 0.9, maybe), and if $P(\pi_i = \hat{k})$ is less than this threshold, call the state assignment as "unknown/uninferred"; otherwise, assign $\pi_i = \hat{k}$.

Finally, write a function that compares a ground truth state path and an inferred state path (either a Viterbi path, or your decoded path), and counts correct (T) and false (F) assignments, ignoring any "unknown" calls in a decoded state path. The accuracy of an inferred path is $\frac{T}{T+F}$.

2. Do synthetic positive controls

Use the HMM to generate some synthetic positive control sequences. Hint: these sequences will be variable length, because the self-loops on the HMM states produce geometrically distributed segment lengths, not fixed lengths.

Analyze the generated sequences with your HMM algorithms, obtaining a Viterbi state path and a decoded state path. Evaluate the accuracy of each path, for each generated sequence.

What trend(s) do you observe in these accuracies, with respect to features of the generated sequences?

What do you observe about these generated sequences, compared to the actual features of Moriarty's reads? What are a couple of discrepancies where the HMM model is imperfectly describing what the actual data look like?

3. Analyze Moriarty's read sequences

Use the HMM to analyze the 20 sequences in the file of Moriarty's reads. Here you'll need at least two additional functions: a function to parse the read sequences in the moriarty-reads.fa FASTA file, and a function to parse the ground truth table in moriarty-reads.truth and convert those data for each read sequence into state paths.

Does the HMM accurately infer the structure of each chimeric read - which residues came from which source genome? How does posterior decoding compare to Viterbi for doing this inference?

4. Suggest a better HMM, and a better approach altogether

This was good practice with HMM algorithms and all, but when you think about the kinds of sequences this HMM generates (and look at your generated positive control data, compared to the actual sequences), you've seen that the data generated by the model is different from the actual Moriarty read data in some important ways.

Can you suggest an HMM architecture that generates read sequences that are a better fit to the observed Moriarty read data?

Can you suggest a simple probability model that is not an HMM (not explicitly anyway), but which generates read sequences that are an even better fit to the observed Moriarty read data? Hint: a key idea here is that each read is a fixed length L in length, and has one and only one position k at which the source genome switches. This isn't so different from last week's motif finding.

allowed modules

Our list of allowed modules now includes NumPy, Matplotlib, Pandas, Seaborn, and SciPy's stats and special modules.

You will probably want SciPy's special.logsumexp() for summing probabilities that you're storing as log probabilities, in the Forward and Backward algorithms.

My imports, for example:

import numpy as np
import matplotlib.pyplot as plt     
import scipy.special as special     

%matplotlib inline

turning in your work

Submit your Jupyter notebook page (a .ipynb file) to the course Canvas page under the Assignments tab. Name your file <LastName><FirstName>_<psetnumber>.ipynb; for example, mine would be EddySean_08.ipynb.

We grade your psets as if they're lab reports. The quality of your text explanations of what you're doing (and why) are just as important as getting your code to work. We want to see you discuss both your analysis code and your biological interpretations.

Don't change the names of any of the input files you download, because the TFs (who are grading your notebook files) have those files in their working directory. Your notebook won't run if you change the input file names.

Remember to include a watermark command at the end of your notebook, so we know what versions of modules you used. For example,

%load_ext watermark
%watermark -v -m -p jupyter,numpy,matplotlib

w00: intro	lecture	section	"pset"	"answers"
w01: genomes	lecture	[molbio] [py]	pset	answers
w02: probability	lecture	section	pset	answers
w03: data	lecture	section	pset	answers
w04: alignment	lecture	section	pset	answers
w05: Bayes	lecture	section	pset	answers
w06: p-values	lecture	section	pset	answers
w07: EM	lecture	section	pset	answers
w08: HMMs	lecture	section	pset	answers
w09: k-means	lecture	section	pset	answers
w10: regression	lecture	section	pset	answers
w11: PCA	lecture	section	pset	answers
w12: work	lecture	-	-	-
w13: trees	lecture	section	pset	answers