- Moriarty’s experiment
- 1. install kallisto
- 2. reproduce Moriarty’s result
- 3. simulate an Arc transcriptome and RNA-seq reads
- 4. test kallisto
- 5. “debug” kallisto
- turning in your work
the adventure of the ten Arcs
The sand mouse genome has an unusual locus called Arc that expresses ten overlapping mRNA transcripts called Arc1 through Arc10.
The Arc mRNAs are circularly permuted, because the DNA at the Arc locus is excised from the genome and circularized. (Things like this do happen in biology – for example, there are extrachromosomal circular ribosomal DNAs in the frog Xenopus.) The Arc locus consists of ten segments (labelled A through J). Each segment is exactly 1000nt long, so the total Arc circle is 10kb. Each of the ten Arc mRNA isoforms starts at a corresponding segment, and is 2-4 segments (2000-4000nt) long.
The sand mouse Arc locus consists of ten segments A-J. Starting at each segment, it is transcribed into ten overlapping mRNA transcripts, Arc1-Arc10, each of which is 2-4 segments long.
You’ve just published part of your PhD thesis research, in which you accurately measured the per-nucleotide abundances for all ten Arc transcripts (and converted them to TPM), and you concluded that Arc1 and Arc6 are the rarest transcripts, by a factor of about 3x:
|transcript||abundance ()||TPM ()||length ()||segments_covered|
Moriarty, a new student in the lab, doesn’t think you know how to do RNA-seq experiments. He’s done another RNA-seq experiment on the sand mouse Arc locus himself, under the same experimental conditions as you.
He did 100,000 single-ended, non-strand-specific 75nt reads from a cDNA fragment library of mean length 150bp, standard deviation 20bp, with an estimated base-calling accuracy of 99.9% per base (Q30 bases). He analyzed his data with kallisto. Based on his results, Moriarty says that the Arc1-Arc10 transcript abundances (in TPM) are actually more like:
and based on that, he says you underestimated the abundance of Arc1 by about 3- to 4-fold and Arc6 by about 2-fold – so much so, that he challenges one of your main conclusions. His analysis says that Arc1 and Arc6 aren’t as rare as you say, and that Arc9 is expressed about the same level as them.
There are no obvious differences in how Moriarty did his RNA-seq experiment - the only difference seems to be that he used kallisto to do his quantitation, whereas you used your own independent mapping and EM quantitation code.
You decide to reproduce Moriarty’s analysis, and to use a positive control simulation to test the accuracy of kallisto’s quantitation.
1. install kallisto
Download and install kallisto:
kallisto runs on Mac OS/X, Linux, and other unix-like systems. To get kallisto running on Windows 10, you’ll want to install Ubuntu bash.
2. reproduce Moriarty’s result
Use kallisto to analyze these data and reproduce Moriarty’s result.
3. simulate an Arc transcriptome and RNA-seq reads
Write Python code to simulate an Arc locus, an Arc transcriptome, and 100,000 reads from an RNA-seq experiment on the Arc transcriptome. Your script generates two simulated data files that kallisto will take as its input:
- a FASTA format file of the transcriptome
- a FASTQ format file of the reads
Your simulation will be controlled by several parameters. It’ll be useful to leave them as settable parameters that you can play with in different simulations. For example, here’s a chunk from mine:
# Set up the Arc locus # S = 10 # Number of segments in the Arc locus (A..J) T = S # Number of different transcripts (the same, one starting on each segment, 1..10) N = 100000 # total number of observed reads we generate alpha = 0.999 # base calling accuracy (Q30 bases, typical of current Illumina) len_S = 1000 # length of each segment (nucleotides) len_Arc = len_S * S # total length of the Arc locus (nucleotides) len_R = 75 # read length mean_frag = 150 # fragment size: mean (of a truncated Gaussian) sd_frag = 20 # fragment size: stdev
Set up your simulation so that it can either use the structure of the Arc locus shown above (i.e. those specific lengths and those specific abundances ), or it can generate new random choices for lengths and abundances .
To generate an Arc locus DNA sequence:
- assume S=10 segments, A-J
- assume each segment is 1000 bp long
- make random DNA of uniform base composition, 25% each base (ACGT)
- your total DNA sequence will be 10,000 bp
To generate the Arc1..Arc10 mRNA transcripts:
- extract them from the Arc locus by their coordinates (being careful to handle the circular permutation!)
- write them in FASTA format to a
kallisto models the cDNA library fragment length distribution (so that it can calculate an “effective length” of each mRNA, correcting for the fact that library fragmentation and size selection selects against small cDNAs). So to generate each read, first have your simulation generate a random fragment, then generate a read from one of its ends:
sample a transcript according to its nucleotide abundance ;
pick a random fragment length at least as long as one read from a truncated Gaussian of mean length 150, standard deviation 20 (truncated because: no shorter than a read, and no longer than the whole transcript), i.e. something like:
while True: fraglen = int(np.random.normal(mean_frag, sd_frag)) if fraglen >= len_R: break if fraglen > L[i]: fraglen = L[i]
pick a random fragment of that length from transcript ;
pick a 75nt read by choosing a random orientation, taking the read from one of the two ends of the fragment. If on the top strand, your read is the first 75nt of the fragment. If on the bottom strand, your read is the last 75nt of the fragment, reverse complemented.
generate a simulated read sequence by adding simulated base calling errors to the 75nt read: given base calling accuracy , at each base, with probability , change it to one of the other 3 bases.
output the read in FASTQ format to your read file. Repeat for all 100K reads.
4. test kallisto
Use your simulator to generate dataset(s) using the lengths for the Arc locus as shown above, and the abundances that you think are true (first table, above). Analyze the simulated data with kallisto.
Does kallisto get the correct answer?
5. “debug” kallisto
Maybe you just found that kallisto isn’t actually getting its estimation quite right.
Suggest a plausible reason that kallisto might be messing up the Arc analysis. What’s most unusual about Arc, that might violate kallisto’s assumptions somehow? Design at least one experiment that uses your simulator to test your hypothesis – i.e., identify a modification that gives simulated data that kallisto works fine on.
turning in your work
Email a Jupyter notebook page (a
.ipynb file) that implements the
five steps and your observations to
firstname.lastname@example.org. Please name your
<LastName><FirstName>_MCB112_<psetnumber>.ipynb; for example,
mine would be
Conversion between “nucleotide abundance” units (in Li and Dewey’s notation) versus “transcript abundance” units in TPM () is a pesky detail you have to keep straight. Recall:
and multiply by to convert a fraction to “transcripts per million” (TPM). You can check for yourself that the
est_countsper transcript in a kallisto
abundance.tsvfile are proportional to : if you divide by
eff_length, renormalize, and multiply by you get kallisto’s TPM estimate.
As far as I know, kallisto doesn’t pay any attention to the quality values (QV’s) in the FASTQ file. The example data that comes with kallisto uses a QV code of ‘I’ for every base; ‘I’ is ASCII character 73, 73-33=40, so that’s a Q40 base, which means an error probability of . (Why yes, I do have an ASCII code table pinned above my computer. Don’t you?)