MCB112: Biological Data Analysis (Fall 2017)
 we’re using abstract “reads”
 the structure of the Arc locus
 further explication
 1. write a simulator as a positive control
 2. calculate the log likelihood
 3. estimate isoform abundances by EM
 turning in your work
 hints
homework 09:
the return of the ten Arcs
You fondly recall the circularly permuted structure of the Arc locus from hw02, which expresses ten overlapping mRNA transcripts called Arc1 through Arc10. The Arc locus consists of ten segments (labelled A through J). Each of the ten Arc mRNA isoforms starts at a corresponding segment, and is 24 segments long.
The sand mouse Arc locus consists of ten segments AJ. Starting at each segment, it is transcribed into ten overlapping mRNA transcripts, Arc1Arc10, each of which is 24 segments long.
These overlaps made it nontrivial to estimate the abundance of the individual Arc mRNA isoforms by RNAseq. A recent paper by Watson et al. in the Sand Mouse Journal has collected an RNAseq dataset and made an attempt with their own program.
we’re using abstract “reads”
For the purposes of this exercise, we’re going to abstract the “sequencing reads”, as follows.
There are 10 different read “sequences” a..j, corresponding to which segment A..J the read maps to. The length of each transcript isoform , , is 24, measured in segments; for example, Arc1 has length , and it can produce four different reads a,b,c,d corresponding to its four segments A,B,C,D.
We’ll also assume that there is no mismapping (no basecalling error), so when we observe a “read” aj, we know it maps to the corresponding segment AJ.
With this abstraction (which saves us from working in detail with actual sequences and sequence mapping, and lets us work in just the essentials!), we are given read counts for reads , and we’re going to use expectation maximization to estimate the unknown abundances of the 10 Arc mRNA isoforms. Our notation and description will parallel that of Li and Dewey (2010). Our goal is to work through the main contribution of that paper, which is to be able to deduce unknown transcript abundances from observed read counts, even in the presence of multimapped and mismapped reads.
the structure of the Arc locus
Watson et al. provide a supplementary data file that contains information about the structure of the Arc locus (same as shown in the Figure, above right):
Isoform  Length  Sequence 

Arc1  4  ABCD 
Arc2  2  BC 
Arc3  3  CDE 
Arc4  4  DEFG 
Arc5  4  EFGH 
Arc6  3  FGH 
Arc7  2  GH 
Arc8  2  HI 
Arc9  3  IJA 
Arc10  3  JAB 
So, for example, segment A appears in three different transcripts Arc1, Arc9, and Arc10, so even if our read sequences were perfectly accurate, we wouldn’t know whether a read ‘a’ came from Arc1, Arc9, or Arc10. It came from one of them, proportional to their relative abundances, but their relative abundances are exactly what we’re trying to estimate from the read counts!
The supplementary data file also provides the observed data: mapped read counts from an RNAseq experiment that collected a total of 1 million mapped reads:
read sequence  counts 

a  64317 
b  39733 
c  93892 
d  91454 
e  142010 
f  135038 
g  152355 
h  161432 
i  60450 
j  59319 
Watson et al. provide the script they used to
estimate the expression levels of each Arc isoform. Their analysis is
similar to the method called rescue
in
Li and Dewey (2010).
First each read x is assigned to its cognate segment X, then read
counts are assigned to isoforms uniformly: i.e. a read ‘a’ maps to
segment A, and 1/3 count is given to Arc1, Arc9, and Arc10, the three
isoforms that share segment A. This gives them estimates of nucleotide
abundances for each transcript, which they convert to
transcript abundances . (If multiplied by , the
transcript abundances would be in TPM; I haven’t done that here
because it doesn’t make sense for only 10
transcripts. [SRE: note added Saturday 8pm])
The estimates resulting from the Watson et al. analysis are:
isoform  

Arc1  0.073 
Arc2  0.068 
Arc3  0.110 
Arc4  0.122 
Arc5  0.129 
Arc6  0.125 
Arc7  0.119 
Arc8  0.107 
Arc9  0.082 
Arc10  0.065 
Their conclusion is that all 10 mRNA transcripts are expressed at about the same level, within twofold of each other.
These estimates are terrible, of course. We need something like EM, the approach described by Li and Dewey (2010), to get this right.
further explication
We need to set up some more technical explication before we get to the exercise.
First, let’s make a clear statement of the generative probability model we’re specifiying here: an abstract and simplified version of the model in Li and Dewey (2010).
We have two random variables: the transcript isoform , and a segment that an observed read maps to.
To generate one observed read, we assume there are two steps.
 Choose a transcript isoform with probability , according to the nucleotide abundances; i.e. .
 Given isoform , we choose one of the segments of uniformly – i.e. for isoform of length , we sample its segments with probability .
 Given that a read emanates from segment , it will be correctly mapped to an observed read . (We could introduce a basecall error model and a mismapping rate, but we keep it simple.)
Each of the total reads is sampled independently one at a time by this process.
Therefore the joint probability is assumed to be factorized as:
Remember that the here is an mRNA abundance in fraction of nucleotides, not fraction of transcripts. The notation in Li and Dewey (2010) uses , making a distinction between and because they include a ‘noise’ isoform in their generative model, to model getting unmapped reads that don’t map back to the transcriptome at all. We don’t have that in this exercise, so for us, .
Although the probability model is working with nucleotide abundances internally, reported expression levels (the output of the Watson analysis, for example) are transcript abundances. You’l need to keep nucleotide versus transcript abundance straight in your analysis. Recall that:
To do the expectation step of expectation maximization, you’re going to need to obtain : i.e. if I show you one observed read, which isoform do you guess that it came from? You will assign each observed count proportional to your current estimate for the distribution. After you do this for all the reads , you have an expected read count assigned to each isoform .
To do the maximization step, you’re just going to normalize your expected counts to get a new estimate for .
OK, now we should be ready, in principle; though you may also need to refer both to Li and Dewey (2010) and to this week’s lecture notes.
1. write a simulator as a positive control
Write a script that simulates N=1000000 observed read counts, following the model specified above, for any Arc locus structure (i.e. lengths ) and any transcript abundances . (You can assume that there are 10 segments and isoforms, though my version of this script will allow that to vary too.)
Use you script to generate a test data set for known model parameters that you’ve chosen.
2. calculate the log likelihood
Write a function that calculates the log likelihood: the log probability of the observed data (the observed read counts ) if all the model parameters were known (i.e. ), for a given locus structure of Arc.
Calculate and show the log likelihood of your test data set, for your known parameter values.
Use Watson et al’s script to estimate abundances of each Arc isoform in your test data set. Compare to your true . (Terrible, right?) Calculate and show the log likelihood given these estimated , compared to the log likelihood for the true .
You should observe that the true parameter values give a much better log likelihood than the Watson et al. estimates, because the Watson et al. estimates are poor. Log likelihoods are negative, and higher (closer to 0) is better.
3. estimate isoform abundances by EM
Write a script that estimates unknown isoform abundances for each isoform Arc1..Arc10, given read counts and the structure of the Arc locus including the lengths , using expectation maximization.
Apply your script to the data in the Watson et al. supplementary data file. Show your estimated .
What do you think of Watson et al.’s conclusion that all 10 mRNA transcripts are expressed at about the same level? What are the most abundant two transcripts, and how much of the population to they account for? What are the least abundant two transcripts?
turning in your work
Email a Jupyter notebook page (a .ipynb
file) to
mcb112.teachingfellow@gmail.com. Please name your
file <LastName><FirstName>_MCB112_<psetnumber>.ipynb
; for example,
mine would be EddySean_MCB112_09.ipynb.
hints

First of all, as usual: don’t panic. EM algorithms are conceptually straightforward, though it takes some thinking to keep details straight. As with anything computational, it’s easy to make all sorts of frustrating and silly mistakes when you’re implementing this. Expect to have to debug your scripts.

That’s why we started with “write a data simulation script”. Implementing methods that are conceptually straightforward but prone to error and confusion in implementation is a totally common situation in biological data analysis. A strong approach to it is to first think about what control experiments you’re going to do to make sure that your implementation gets the right answer, if you happened to know the right answer. When we have a generative probability model in mind, we can do a very strong positive control: we can sample synthetic data sets from our generative probabilistic model. If you can’t get the right answer on ideal data (drawn exactly from the same model) you’re not likely to get the right answer on real data.

Had Watson et al. done a positive control like this, they would have realized that their analysis script couldn’t solve this problem correctly. (Even if the name of their script hadn’t already tipped them off.)

You can’t write the simulator without understanding the generative model. So by having you write the simulator first, we’re not just giving you a positive control, we’re also warming you up, making sure you understand the conceptual generative model in Li and Dewey (2010). If you’re doing biological data analysis, you’ll find that the exercise of writing a generator for synthetic positive and negative control datasets forces you to think in terms of generative probability modeling, and forces you to state your assumptions explicitly.

If (when) you run into trouble implementing EM, you can use your generator to generate simpler cases to help you debug. One example is to set all the to 1. Now each read is directly measuring the abundance of transcript isoform . The naive analysis of Watson et al’s script works fine in this case, and so should yours.

When you calculate the log likelihood, note that you have to marginalize over “hidden variables” of the model, here the isoforms , to get the likelihood .

In your EM script, it’s a good idea to use your log likelihood calculation at each iteration, to track the log likelihood for each successive estimation of . EM is essentially a steepestdescents optimizer, so you expect to see your log likelihood improve at each step. On this particular problem, Li and Dewey (2010) proved in their supplementary material that there is only a single global optimum, so you expect the log likelihood of your final estimate to be at least equal to the log likelihood of then true – this is an excellent condition to check. If you want some practice plotting data with
matplotlib
and friends, a plot of log likelihood versus iteration number is a good thing to be looking at. 
For an example where everything is known, that you can use as an additional control that your scripts are working, see this example data file. This gives you a different Arc structure and lengths , known , and observed read counts from my simulator. The results of running my EM script on those data are in this example output file, including the log likelihoods. (My answer doesn’t include the multinomial coefficient term in the log likelihood calculation. It’s a constant with respect to the abundance parameters, so it’s not relevant; it’s common to drop the multinomial coefficient in problems like this, leaving a log likelihood that’s only correct up to a missing constant. [note added Saturday 8pm]) You can use this example to test each of your three scripts: you can check that your simulator generates similar counts given these parameters; you can check that you get the same log likelihood as I do given true . You can also check that your EM implementation gets as close to the true as I do. It will probably be most useful for checking that your generator is working, because once you have your generator, you can generate any number of examples like this for yourself.

How many iterations of EM should you do? In my solution, I just do a lot (namely, 1000). It would be more sophisticated to test for convergence. You could check that the log likelihood has stopped improving; though sometimes you’ll get in situations where your parameter estimates are changing while your log likelihood is only changing gradually, because you’re skating along a plateau in the likelihood surface. Another convergence test is to check whether the estimated parameters are changing; but then you have to define what your (scalar) measure of change for the (vector) is going to be, and there’s different ways you might do that. This is a bit of an art form, and it’s ok not to worry about it for this exercise. Our iterations are not expensive, and you can just do a bunch of iterations and beat the problem into submission, like my solution is going to do.