MCB112: Biological Data Analysis (Fall 2017)

week 01:

the importance of doing arithmetic

Let’s start with an introduction to genomes, genes, mRNA transcription, and RNA-seq experiments. This is mostly intended for those of you with little biology background. It’s also intended to be telegraphic. You should refer to molecular biology textbooks and RNA-seq protocol reviews for details. For now, we just need enough information to make sure everyone knows what various terms and numbers mean.

Even if you’re a biologist, though, you should find this intro useful. Throughout the course, we’ll emphasize quantitative intuition: thinking about biological systems in mathematical and physical terms. I don’t mean anything sophisticated by that. I mean thinking in terms of ballpark numbers, simple quantitative relationships, and common sense. You should have a feel, even just within an order of magnitude, of when experimental results make sense. If an RNA-seq experiment tells you your favorite gene is expressed at one million RNAs per cell, you should know that a typical cell doesn’t even have that many mRNAs total.

Physicists sometimes call this kind of rough and ready numerical thinking Fermi estimation – a nod to Enrico Fermi’s legendary capability for doing rough calculations quickly. And like everything that’s important in science, there’s a classic xkcd about it.

what’s in our genome, by the numbers

On the board, I’ll go through:

(Some people are surprised that known, annotated protein-coding gene transcription units account for 40% of the genome, because there’s been so much written about how “surprising” it is that much of the human genome is transcribed. It’s easy enough to verify just by counting in the annotation of the current human genome assembly.)

what is a gene (an RNA perspective)

Again on the board, I’ll sketch the standard canon of how we understand genes, especially from the perspective of RNA transcription (neglecting protein translation, at least for now, because the course is going to use RNA expression so much as an example of data analysis):

RNA content of a cell

To interpret an RNA-seq experiment we want to have some intuition for what we’re counting. How many RNAs are there in a cell, from each gene? Here, I’m going to use specific numbers from Jackson, Pombo, and Iborra, The balance sheet for transcription: an analysis of nuclear RNA metabolism in mammalian cells, though there’s many other relevant references as well.

A HeLa cell (a cultured human cancer cell line) divides every 22 hours, and it contains about 3.5M (3.5 million) ribosomes. This means the cell has to make 3.5M new ribosomes every 22h.

This presents an engineering challenge for the cell. It’s hard to produce ribosomes that fast. A major component of the ribosome are the SSU (small subunit) and LSU (large subunit) ribosomal RNAs, which add up to about 7Kb of mature rRNA, processed out of a 13Kb primary transcription unit. So the cell has to be synthesizing 13Kb x 3.5M / 22h / 3600 sec/hr = 600K RNA nucleotides per second, to keep up the pace.

Ribosomal RNA genes are produced by RNA polymerase I, which is specialized for rRNA synthesis (mRNAs are produced by polII), but it runs at about the same speed as polII (maybe a titch faster) – around 40nt/s. So we need about 15,000 active polI polymerases to polymerize 600K nucleotides/sec of rRNA.

We just can’t do it with just two rRNA loci. Physically, the polI machinery occupies about 50-100nt of DNA; even if we load them as fast and as tightly as possible, we could get about one polI complex every 100nt or so, which is indeed this is what’s observed in HeLa: about 100-120 active polI polymerases on each rRNA transcription unit. What to do?

The cell does it by parallelization. It amplifies the number of rRNA genes. The haploid human genome contains about 180 repeats of the ribosomal rDNA transcription unit, embedded in a 45kb repeat unit, in tandem arrays on chromosomes 13,14,15,21, and 22. About 8 Mb of our 3200 Mb genome is devoted to these tandemly arrayed rRNA units. (Not too far from the scattered 35Mb that codes all 20,000 annotated protein-coding genes.) According to Jackson et al. 2000, about 120-150 of the rRNA repeats appear to be transcriptionally active at any one time. (I don’t know what’s up with the rest; that’s an interesting discrepancy. I would guess they’re actually all active.) Now the factory’s balance sheet works: 120-150 active units times 100-120 polI polymerases on each times 40nt/s = 480K-720K nucleotides/sec.

Jackson et al. 2000 now switches to using some numbers from mouse L cells, but for our Fermi-estimation-purposes right now, all mammalian cells are roughly the same. In total, a growing mammalian cell is synthesizing about 3000K (3M) nucleotides/sec; 39% of that active synthesis is polI on pre-rRNA, 58% is polII on pre-mRNAs, and 3% is polIII on small RNA genes. (Here there’s a 2x discrepancy in Jackson’s numbers for pre-rRNA; 39% of 3000K/sec works out to 1200K/sec, while we had 600K/sec above. Close enough for Fermi! Could be a HeLa/L cell difference.) Assuming that polI, polII, and polIII all move at around 40-50nt/sec, then we have something like 30K active polI polymerases, 60K polII, and 3K polIII.

Those deduced numbers roughly agree with other independent measurements in HeLa cells, which have estimated there’s 15,000 active polI, 70,000 active polII, and 7,000 active polIII polymerases. Only a minority of RNA polymerases are actively transcribing genes at any given time; there’s an estimated total of 320K polII polymerases in a HeLa cell.

There are about 20,000 protein-coding genes, so if we have 60K active polII polymerases (and if we ignore other noncoding polII transcription, such as lncRNAs), we have about 3 RNA polymerases on each protein-coding gene. Only about half (10K) of genes are on in a given cell type, so call it an average of ~6 polymerases per active gene. Those polymerases are pretty lonely – the average polII transcription unit is around 40kb long, so polII polymerases are spaced about 10kb apart on typical active genes, in contrast to the jam-packed, hard-working polI rRNA transcription units.

The RNA population of big, intron-containing pre-mRNA transcripts is called heterogeneous nuclear RNA (hnRNA). Introns are spliced, typically rapidly, and splicing is happening on most transcripts even as they’re still being transcribed. Only 2-3% of hnRNA (i.e. pre-mRNA) reaches the cytoplasm as mature mRNAs, after splicing and processing. Introns are big.

In terms of total RNA at steady state (as opposed to synthesis rates), 75% of HeLa RNA is rRNA, 10% is tRNA, and 2.5% is mRNA (and I think the missing fraction in these numbers is a combination of mitochondrial RNA, nuclear pre-mRNA, and small noncoding RNAs). Assuming that the 75% rRNA consists of 3.5M copies of 7Kb of SSU+LSU, and assuming that mRNAs are ~2Kb and tRNAs are ~100nt, we can estimate there are about 30M tRNAs per cell (about 10 tRNAs per ribosome), and about 400K mRNAs (about one for every ten ribosomes). That turns out to be about right!

Not all genes are expressed in every cell type; typically, about half (10K) are expressed in any given cell type. So, ballpark, with 400K mRNAs/cell and 10K genes, we expect a mean of around 40 mRNAs/gene. (And again, we expect a skewed distribution, with a lower median, because some genes will be highly expressed.)

Figure 1c from [Hebenstreit et al 2011] showing that mRNA expression levels per gene are bimodal-ish, sort of as if there's an "off" and an "on" distribution. Notice that the x-axis is a logarithmic scale. Fitting a normal distribution to the _logarithm_ of the expression level means a so-called _lognormal_ distribution. If we plotted it just as expression level, we'd we'd see a skewed distribution with a fat right tail: most genes are expressed at moderate levels, and a few at high levels.

When genes are “off” they’re probably still being transcribed at some low rate. Turning genes “on” probably up-regulates them by about 100x (with a wide range, depending on the gene). A result that I think is important comes from Hebenstreit et al., RNA sequencing reveals two major classes of gene expression levels in metazoan cells, Molecular Systems Biology, 2011. Their key result is shown in the figure to the right. If you plot a histogram of RNA-seq expression levels in a single cell type, you tend to get a bimodal-ish distribution, as if there is an “off” peak at around 0.01-0.1 RPKM (very roughly, 0.01-0.1 mRNA/cell), and an “on” peak around 20 mRNA/cell.

This is so eerily close to what we expect from numerology that it’s part of the working model that I currently have in my head when I read RNA-seq papers:

Some things about RNA-seq experiments start to make a surprising amount of sense in light of these ballpark numbers. For example, many people are doing cell type specific transcriptomics these days (including my lab). We expect to see 100-1000x “enrichment” for many genes in cross cell type comparisons, but instead, in many protocols, it’s more typical to see 2-10x enrichment for genes that are “on” versus “off”. This probably reflects the fact that a cell type specific purification protocol isn’t perfect, and is carrying along substantial background contamination.

mRNA synthesis, decay, and steady state

Figure 2a and 2b from [Schwanhausser et al 2013] showing distribution of mammalian mRNA half-lifes (in hr), with a median around 9h, and mRNA steady-state levels (in mRNA/cell), with a median around 17.

Let’s suppose, as an approximation, that mRNA levels in a cell are at steady state. Let’s also assume that there are only two steps controlling mRNA levels: mRNAs are synthesized at rate (in mRNAs/hr/cell), and mRNAs decay at rate (in hr^-1). Steady state is when the number of new mRNAs synthesized equals the number that decay:

\begin{equation} k = \lambda \mathrm{[mRNA]}, \end{equation}

so at steady state,

\begin{equation} \mathrm{[mRNA]} = \frac{k}{\lambda}. \end{equation}

So if the synthesis rate is 2 mRNAs/hr/cell, and the mRNA decay rate is 0.1 hr^-1, then there are 20 mRNAs/cell at steady state.

mRNA decay is more usually expressed by RNA biochemists in terms of the half life (in hours), not as the decay rate. The half life is ; and .

Figure 3b from [Schwanhausser et al 2013] showing distribution of inferred mRNA synthesis rates, in mRNA/hr, with a median of around 2.

One reference for typical mammalian mRNA synthesis and decay rates is [Schwanhausser et al., 2013, Global quantification of mammalian gene expression control]. Some of their key figure panels are reproduced here to the right. In mammalian cells, typical synthesis rates are on the order of 2 mRNAs/hr; typical half lives are on the order of 9h; and steady-state mRNA copy numbers are on the order of 20.

If polII initiates at 2 mRNA/hr, on a ~30kb pre-mRNA transcription unit, as we saw above, the picture is very different from rRNA synthesis. At 40nt/s, it takes ~10m for polII to transcribe a typical locus, so at any given time there’s on the order of one polII complex somewhere on a typical gene - a protein-coding gene is a lonely back road, compared to the thick traffic on ribosomal RNA superhighways.

Notice that I’m sneakily slipping back and forth between “mean” and “typical” (i.e. median). If we were talking about normal (Gaussian) distributions of things, mean and median are the same thing. But the distributions we’re talking about are skewed: notice that the plots from the Schwanhausser paper are roughly lognormal. So I’m not too bothered by the discrepancy between one polII complex per “typical” gene, versus six polII’s per gene on average; the lognormal distribution shifts the mean upwards from the median, the same way that average household income in the US looks pretty good until you look at the median income.

With small numbers like 20 mRNA/cell, you can run into problems with continuous approximations like the steady-state equations above. This is why there’s a lot of interest in stochastic dynamics of small numbers of things per cell. For example, many promoters don’t fire at a steady rate; single cell experiments have shown that they “burst”, with bursts of initiation followed by more quiescent states. We’re neglecting that in these ballpark numbers.

This steady state approximation neglects that dilution effect of cell growth and division: if there’s 20 mRNAs/cell, then the cell has to make 20 new mRNAs per cell division to keep things constant. In a HeLa cell doubling every 22hr, that’s not an insignificant concern.

an RNA-seq experiment

Conceptually, an RNA-seq experiment is straightforward, but the details can matter quite a bit in the interpretation.

In principle, all we’re going to do is purify RNA transcripts from a biological sample, and measure the relative level of each transcript in that sample – , in units of transcripts per million transcripts (TPM) – essentially by counting a random sample of short RNA fragments.

In practice, we need to think about what’s going on at each step:

Nobody should be calling an RNA-seq experiment “unbiased”. A good description of an RNA-seq experiment should describe all the steps that may have biased the sequences that ended up being collected.

RNA-seq analysis: mapped read counts to TPM

Each gene expresses some number of mRNA transcripts per cell; let’s call that number . (We’re already making a strong simplifying assumption – that there’s only one mRNA splicing isoform per gene – but let’s go with that for now.) In an RNA-seq experiment, we’ve usually taken our RNA sample from an unknown number of cells, so we’re generally going to have to think in terms of relative proportions, not absolute numbers. In a large sample, the proportion of mRNA transcripts from gene is . These unknown expression levels are what we want to infer from our observed counts of mapped reads.

In RNA-seq data processing, we map short sequence reads to annotated transcript sequences, which gives us a number of mapped reads per transcript: . These counts are our observed data. The proportion of mapped reads that mapped to transcript is . We can also consider the normalized to be observed data.

The RNA-seq experiment doesn’t measure directly. We made a fragment library and obtained sequence reads from the ends of each fragment, so long mRNAs are more likely to be sampled than short ones. A basic assumption of short-read RNA-seq analysis is that we sample fragments, and therefore reads, from mRNA transcripts with a probability proportional to , where is the transcript length. (Imagine simulating the experimental process with a Python script.) So:

\begin{equation} \nu_i = \frac{\tau_i \ell_i}{\sum_j \tau_j \ell_j} \end{equation}

and therefore:

\begin{equation} \tau_i = \frac{\nu_i}{\ell_i} \left( \sum_j \frac{\nu_j}{\ell_j} \right)^{-1} \end{equation}

That is: first we normalize each individually by the mRNA length ; then we normalize by the total sum of over all genes; and this gives us our estimates of .

It’s convenient to scale by a constant that’s on the same order as the number of mRNAs per cell: one million. (Fermi estimation again.) Thus values are reported in units of “transcripts per million transcripts” (TPM), which you can think of as being on the order of mRNAs per cell (very roughly, because different cells have more or fewer than mRNAs in them).

Look back over that and you’ll see several assumptions that we could try to improve on. We don’t have to assume that each gene gives rise to only a single mRNA isoform, but because isoforms overlap (share exon sequences) we wouldn’t be able to assume that there was an observable 1:1 relationship between a mapped read and a particular mRNA ; we’d have to make a statistical model that treats the mapping of a read to a specific isoform as something we’d need infer. We also don’t have to assume that all nucleotide positions are uniformly sampled by short reads, because of biases in the RNA-seq library generation and sequencing procedure; we could model that bias. We’ll learn some techniques for such statistical modeling as we go forward in the course.


The mathematical model above was explained in a seminal paper from Colin Dewey’s group [Li et al., Bioinformatics 2010], and I’ve followed their notation. But it is not the procedure that was first used in RNA-seq, when RNA-seq was first introduced by Barbara Wold’s group [Mortazavi et al., Nature Methods 2008]. In Mortazavi et al., a different procedure was used, and gene expression estimates were obtained in units of “reads per kilobase per million mapped reads” (RPKM). (When people started using paired-end sequencing more often, it became more reasonable to talk in terms of mapping library fragments instead of independent reads, thus fragments per kilobase per million mapped reads (FPKM), but RPKM and FPKM are the same thing for our purposes.)

Converting mapped read counts to RPKM gives us . ( because per kilobase per million.) This is proportional to (within a given sample) but it’s unnormalized. We would need to normalize by if we want to convert RPKM to TPM: that is, by the abundance-weighted mean mRNA transcript length. If the abundance-weighted mean mRNA transcript length is 1kb, TPM and RPKM are the same thing.

So do we care? A problem with RPKM arises when we start trying to compare across samples, because different samples don’t necessarily have the same abundance-weighted mean mRNA transcript length. Maybe in one sample some long mRNAs went up, and some short mRNAs went down. Now we could see that an mRNA transcript , present at exactly the same proportion in the mRNA populations of the two samples, would show different RPKM measures just because some other genes shifted their expression and altered the abundance-weighted mean mRNA transcript length.

Don’t dis RPKMs too much though, even if it’s amusing to see Lior Pachter flagellate himself over it, because you can get a related normalization artifact from TPMs too, and it’s an artifact that’s probably even more common. The relative abundance of of transcript is obviously going to be affected by the absolute expression level of every other mRNA in the cell, so it is a mistake to assume that is proportional to the mRNA synthesis rate for gene $i$. If you turn a bunch of some genes up, the relative proportion of other genes must go down even if their absolute transcription rate remains unchanged. One common way this can happen is if you alter the growth rate, which changes the expression level of a large battery of genes having to do with making ribosomes (among other things).

There’s been an amazing amount of wailing and gnashing of teeth over RPKM vs. TPM. Much of it seems confusing and/or wrong to me. I think it’s a great example of people passing along lore and sounding authoritative without having a quantitative understanding of what they’re talking about. The Li et al. 2010 paper is clear and correct.