MCB112: Biological Data Analysis (Fall 2017)

homework 09:

the return of the ten Arcs

You fondly recall the circularly permuted structure of the Arc locus from hw02, which expresses ten overlapping mRNA transcripts called Arc1 through Arc10. The Arc locus consists of ten segments (labelled A through J). Each of the ten Arc mRNA isoforms starts at a corresponding segment, and is 2-4 segments long.

The sand mouse Arc locus consists of ten segments A-J. Starting at each segment, it is transcribed into ten overlapping mRNA transcripts, Arc1-Arc10, each of which is 2-4 segments long.

These overlaps made it nontrivial to estimate the abundance of the individual Arc mRNA isoforms by RNA-seq. A recent paper by Watson et al. in the Sand Mouse Journal has collected an RNA-seq dataset and made an attempt with their own program.

we’re using abstract “reads”

For the purposes of this exercise, we’re going to abstract the “sequencing reads”, as follows.

There are 10 different read “sequences” a..j, corresponding to which segment A..J the read maps to. The length of each transcript isoform , , is 2-4, measured in segments; for example, Arc1 has length , and it can produce four different reads a,b,c,d corresponding to its four segments A,B,C,D.

We’ll also assume that there is no mismapping (no basecalling error), so when we observe a “read” a-j, we know it maps to the corresponding segment A-J.

With this abstraction (which saves us from working in detail with actual sequences and sequence mapping, and lets us work in just the essentials!), we are given read counts for reads , and we’re going to use expectation maximization to estimate the unknown abundances of the 10 Arc mRNA isoforms. Our notation and description will parallel that of Li and Dewey (2010). Our goal is to work through the main contribution of that paper, which is to be able to deduce unknown transcript abundances from observed read counts, even in the presence of multimapped and mismapped reads.

the structure of the Arc locus

Watson et al. provide a supplementary data file that contains information about the structure of the Arc locus (same as shown in the Figure, above right):

Isoform Length Sequence
Arc1 4 ABCD
Arc2 2 BC
Arc3 3 CDE
Arc4 4 DEFG
Arc5 4 EFGH
Arc6 3 FGH
Arc7 2 GH
Arc8 2 HI
Arc9 3 IJA
Arc10 3 JAB

So, for example, segment A appears in three different transcripts Arc1, Arc9, and Arc10, so even if our read sequences were perfectly accurate, we wouldn’t know whether a read ‘a’ came from Arc1, Arc9, or Arc10. It came from one of them, proportional to their relative abundances, but their relative abundances are exactly what we’re trying to estimate from the read counts!

The supplementary data file also provides the observed data: mapped read counts from an RNA-seq experiment that collected a total of 1 million mapped reads:

read sequence counts
a 64317
b 39733
c 93892
d 91454
e 142010
f 135038
g 152355
h 161432
i 60450
j 59319

Watson et al. provide the script they used to estimate the expression levels of each Arc isoform. Their analysis is similar to the method called rescue in Li and Dewey (2010). First each read x is assigned to its cognate segment X, then read counts are assigned to isoforms uniformly: i.e. a read ‘a’ maps to segment A, and 1/3 count is given to Arc1, Arc9, and Arc10, the three isoforms that share segment A. This gives them estimates of nucleotide abundances for each transcript, which they convert to transcript abundances . (If multiplied by , the transcript abundances would be in TPM; I haven’t done that here because it doesn’t make sense for only 10 transcripts. [SRE: note added Saturday 8pm])

The estimates resulting from the Watson et al. analysis are:

Arc1 0.073
Arc2 0.068
Arc3 0.110
Arc4 0.122
Arc5 0.129
Arc6 0.125
Arc7 0.119
Arc8 0.107
Arc9 0.082
Arc10 0.065

Their conclusion is that all 10 mRNA transcripts are expressed at about the same level, within two-fold of each other.

These estimates are terrible, of course. We need something like EM, the approach described by Li and Dewey (2010), to get this right.

further explication

We need to set up some more technical explication before we get to the exercise.

First, let’s make a clear statement of the generative probability model we’re specifiying here: an abstract and simplified version of the model in Li and Dewey (2010).

We have two random variables: the transcript isoform , and a segment that an observed read maps to.

To generate one observed read, we assume there are two steps.

Each of the total reads is sampled independently one at a time by this process.

Therefore the joint probability is assumed to be factorized as:

Remember that the here is an mRNA abundance in fraction of nucleotides, not fraction of transcripts. The notation in Li and Dewey (2010) uses , making a distinction between and because they include a ‘noise’ isoform in their generative model, to model getting unmapped reads that don’t map back to the transcriptome at all. We don’t have that in this exercise, so for us, .

Although the probability model is working with nucleotide abundances internally, reported expression levels (the output of the Watson analysis, for example) are transcript abundances. You’l need to keep nucleotide versus transcript abundance straight in your analysis. Recall that:

To do the expectation step of expectation maximization, you’re going to need to obtain : i.e. if I show you one observed read, which isoform do you guess that it came from? You will assign each observed count proportional to your current estimate for the distribution. After you do this for all the reads , you have an expected read count assigned to each isoform .

To do the maximization step, you’re just going to normalize your expected counts to get a new estimate for .

OK, now we should be ready, in principle; though you may also need to refer both to Li and Dewey (2010) and to this week’s lecture notes.

1. write a simulator as a positive control

Write a script that simulates N=1000000 observed read counts, following the model specified above, for any Arc locus structure (i.e. lengths ) and any transcript abundances . (You can assume that there are 10 segments and isoforms, though my version of this script will allow that to vary too.)

Use you script to generate a test data set for known model parameters that you’ve chosen.

2. calculate the log likelihood

Write a function that calculates the log likelihood: the log probability of the observed data (the observed read counts ) if all the model parameters were known (i.e. ), for a given locus structure of Arc.

Calculate and show the log likelihood of your test data set, for your known parameter values.

Use Watson et al’s script to estimate abundances of each Arc isoform in your test data set. Compare to your true . (Terrible, right?) Calculate and show the log likelihood given these estimated , compared to the log likelihood for the true .

You should observe that the true parameter values give a much better log likelihood than the Watson et al. estimates, because the Watson et al. estimates are poor. Log likelihoods are negative, and higher (closer to 0) is better.

3. estimate isoform abundances by EM

Write a script that estimates unknown isoform abundances for each isoform Arc1..Arc10, given read counts and the structure of the Arc locus including the lengths , using expectation maximization.

Apply your script to the data in the Watson et al. supplementary data file. Show your estimated .

What do you think of Watson et al.’s conclusion that all 10 mRNA transcripts are expressed at about the same level? What are the most abundant two transcripts, and how much of the population to they account for? What are the least abundant two transcripts?

turning in your work

Email a Jupyter notebook page (a .ipynb file) to Please name your file <LastName><FirstName>_MCB112_<psetnumber>.ipynb; for example, mine would be EddySean_MCB112_09.ipynb.