MCB112: Biological Data Analysis (Fall 2017)

homework 02:

the adventure of the ten Arcs

The sand mouse genome has an unusual locus called Arc that expresses ten overlapping mRNA transcripts called Arc1 through Arc10.

The Arc mRNAs are circularly permuted, because the DNA at the Arc locus is excised from the genome and circularized. (Things like this do happen in biology – for example, there are extrachromosomal circular ribosomal DNAs in the frog Xenopus.) The Arc locus consists of ten segments (labelled A through J). Each segment is exactly 1000nt long, so the total Arc circle is 10kb. Each of the ten Arc mRNA isoforms starts at a corresponding segment, and is 2-4 segments (2000-4000nt) long.

The sand mouse Arc locus consists of ten segments A-J. Starting at each segment, it is transcribed into ten overlapping mRNA transcripts, Arc1-Arc10, each of which is 2-4 segments long.

You’ve just published part of your PhD thesis research, in which you accurately measured the per-nucleotide abundances for all ten Arc transcripts (and converted them to TPM), and you concluded that Arc1 and Arc6 are the rarest transcripts, by a factor of about 3x:

transcript abundance () TPM () length () segments_covered
Arc1 0.008 6000 4000 ABCD
Arc2 0.039 58000 2000 BC
Arc3 0.291 290000 3000 CDE
Arc4 0.112 83000 4000 DEFG
Arc5 0.127 94000 4000 EFGH
Arc6 0.008 7800 3000 FGH
Arc7 0.059 87000 2000 GH
Arc8 0.060 88000 2000 HI
Arc9 0.022 22000 3000 IJA
Arc10 0.273 270000 3000 JAB

Moriarty’s experiment

Moriarty, a new student in the lab, doesn’t think you know how to do RNA-seq experiments. He’s done another RNA-seq experiment on the sand mouse Arc locus himself, under the same experimental conditions as you.

He did 100,000 single-ended, non-strand-specific 75nt reads from a cDNA fragment library of mean length 150bp, standard deviation 20bp, with an estimated base-calling accuracy of 99.9% per base (Q30 bases). He analyzed his data with kallisto. Based on his results, Moriarty says that the Arc1-Arc10 transcript abundances (in TPM) are actually more like:

transcript TPM
Arc1 21000
Arc2 56000
Arc3 280000
Arc4 83000
Arc5 94000
Arc6 17000
Arc7 77000
Arc8 88000
Arc9 29000
Arc10 260000

and based on that, he says you underestimated the abundance of Arc1 by about 3- to 4-fold and Arc6 by about 2-fold – so much so, that he challenges one of your main conclusions. His analysis says that Arc1 and Arc6 aren’t as rare as you say, and that Arc9 is expressed about the same level as them.

There are no obvious differences in how Moriarty did his RNA-seq experiment - the only difference seems to be that he used kallisto to do his quantitation, whereas you used your own independent mapping and EM quantitation code.

You decide to reproduce Moriarty’s analysis, and to use a positive control simulation to test the accuracy of kallisto’s quantitation.

1. install kallisto

Download and install kallisto:

kallisto runs on Mac OS/X, Linux, and other unix-like systems. To get kallisto running on Windows 10, you’ll want to install Ubuntu bash.

2. reproduce Moriarty’s result

Moriarty provides you with a gzip’ed FASTA file containing the transcript sequences of the 10 Arc transcripts, and a gzip’ed FASTQ file containing his 100,000 reads.

Use kallisto to analyze these data and reproduce Moriarty’s result.

3. simulate an Arc transcriptome and RNA-seq reads

Write Python code to simulate an Arc locus, an Arc transcriptome, and 100,000 reads from an RNA-seq experiment on the Arc transcriptome. Your script generates two simulated data files that kallisto will take as its input:

Your simulation will be controlled by several parameters. It’ll be useful to leave them as settable parameters that you can play with in different simulations. For example, here’s a chunk from mine:

# Set up the Arc locus 
S         = 10           # Number of segments in the Arc locus (A..J)
T         = S            # Number of different transcripts (the same, one starting on each segment, 1..10)
N         = 100000       # total number of observed reads we generate
alpha     = 0.999        # base calling accuracy (Q30 bases, typical of current Illumina)
len_S     = 1000         # length of each segment (nucleotides)
len_Arc   = len_S * S    # total length of the Arc locus (nucleotides)
len_R     = 75           # read length
mean_frag = 150          # fragment size: mean (of a truncated Gaussian)
sd_frag   = 20           # fragment size: stdev

Set up your simulation so that it can either use the structure of the Arc locus shown above (i.e. those specific lengths and those specific abundances ), or it can generate new random choices for lengths and abundances .

To generate an Arc locus DNA sequence:

To generate the Arc1..Arc10 mRNA transcripts:

kallisto models the cDNA library fragment length distribution (so that it can calculate an “effective length” of each mRNA, correcting for the fact that library fragmentation and size selection selects against small cDNAs). So to generate each read, first have your simulation generate a random fragment, then generate a read from one of its ends:

4. test kallisto

Use your simulator to generate dataset(s) using the lengths for the Arc locus as shown above, and the abundances that you think are true (first table, above). Analyze the simulated data with kallisto.

Does kallisto get the correct answer?

5. “debug” kallisto

Maybe you just found that kallisto isn’t actually getting its estimation quite right.

Suggest a plausible reason that kallisto might be messing up the Arc analysis. What’s most unusual about Arc, that might violate kallisto’s assumptions somehow? Design at least one experiment that uses your simulator to test your hypothesis – i.e., identify a modification that gives simulated data that kallisto works fine on.

turning in your work

Email a Jupyter notebook page (a .ipynb file) that implements the five steps and your observations to Please name your file <LastName><FirstName>_MCB112_<psetnumber>.ipynb; for example, mine would be EddySean_MCB112_02.ipynb.