by Verena Volf and Kevin Mizes (2018)

You can download these notes as a Jupyter notebook.

by Verena Volf (2018)

RNA-seq is the technology of choice for genome-wide differential gene expression (DGE) experiments.

The starting point for an RNA-Seq experiment is a set of $n$ RNA samples, usually associated with different treatment conditions. Each sample is sequenced and the resulting short reads are mapped to an appropriate reference genome. The number of reads mapped to each genomic feature (such as gene, exon or transcript) of interest is counted. The number of reads of a sample $i$ that are mapped to gene $g$ are denoted $y_{gi}$.

Popular software tools for RNA seq analysis include baySeq, cuffdiff, DEGSeq, DESeq, DESeq2, EBSeq, edgeR (exact and glm modes), limma, NOISeq, PoissonSeq, and SAMSeq.
Out of those, **edgeR** and DESEq 2 are the most widely used tool due to their superior TP identification rate and well-controlled FDR at lower fold changes.

For this class we use the software edgeR. This section provides a short edgeR tutorial which can be run by copying the code below into R. The code is divided into 6 sections which are described below. Running the code requires that you downloaded R, BioConductor and edgeR for which instructions are provided in the description of homework 13.

In [ ]:

```
# RNA-seq analysis script
library(edgeR)
### 1. Download and clean data
# download data set
fname <- "http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE49712&format=file&file=GSE49712_HTSeq.txt.gz"
download.file(fname, destfile = "GSE49712_HTSeq.txt.gz")
data_raw <- read.table("GSE49712_HTSeq.txt.gz", header = TRUE)
#explore
# Each each row corresponds to a gene, and each column corresponds to a sample.
dim(data_raw)
head(data_raw)
tail(data_raw)
#Notice that the last five lines contain summary statistics and thus need to be removed prior to testing.
data_clean <- data_raw[1:(nrow(data_raw) - 5), ]
### 2. Remove genes with low expression
#create two groups
group <- substr(colnames(data_clean), 1, 1)
group
d <- DGEList(counts = data_clean, group = group)
dim(d) # 21711 x 10
head(cpm(d))
apply(d$counts, 2, sum) # total gene counts per sample
keep <- rowSums(cpm(d)>100) >= 2
d <- d[keep,]
dim(d) # 3605 x 10.
#reset the library sizes after filtering.
d$samples$lib.size <- colSums(d$counts)
d$samples
### 3. Normalization
d <- calcNormFactors(d)
d
### 4. Data exploration
cpm_log <- cpm(d, log = TRUE)
#heatmap(cor(cpm_log))
#MDS plot
plotMDS(d, method="bcv", col=as.numeric(d$samples$group))
### 5. estimate the dispersion
d <- estimateDisp(d)
sqrt(d$common.dispersion) # biological coefficient of variation, 0.0822604
plotBCV(d)
### 6. exact test
et <- exactTest(d)
dim(et)
results_edgeR <- topTags(et, n = nrow(data_clean), sort.by = "none")
head(results_edgeR$table)
#How many genes are differentially expressed at an FDR of less than 10%?
sum(results_edgeR$table$FDR < .1)
plotSmear(et, de.tags = rownames(results_edgeR)[results_edgeR$table$FDR < .1])
abline(h = c(-2, 2), col = "blue")
```

edgeR works on a table of integer read counts, with rows corresponding to genes and columns to independent libraries (RNA samples). The counts represent the total number of reads that map to each gene.

Each function in edgeR has its own online help page. For example, a description of the exactTest function can be obtained by typing ?exactTest or help(exactTest) at the R prompt.

We are going to use the data set generated by the Sequencing Quality Control (SEQC) project. The data set consists of two groups of data:

**Group A** is five technical replicates of the **Stratagene Universal Human Reference RNA**, which is a pool of ten human cell lines.

**Group B** is five technical replicates of **Ambion's Human Brain Reference RNA**, which is RNA samples pooled from multiple donors from different brain regions.

First, we get rid of genes which are not expressed or expressed at very low levels. Here I use a cutoff of 100 counts per million and only keep genes that have a cpm of 100 or greater for at least two samples. Counts per million are calculated with the edgeR function cpm() in R. At this step, we create a vector called "DataGroups" which labels each of the column. We then use this vector and the gene counts to create a DGEList, the object that edgeR uses for storing data from a differential gene expression experiment.

There are alternative ways to filter genes. Another simple method would be to choose a cutoff based on the median log2-transformed counts per gene per million mapped reads (cpm) and remove all genes with a median log2 cpm below the cutoff.

When we take a look at the dimension of our data table (dim(d)), we see that this filtering step reduces the dataset from around 21711 genes to about 3605. For the filtered genes, there is very little power to detect differential expression, so little information is lost by filtering.

We reset the library size (lib.size) after filtering. EdgeR will use the lib.size at later steps to calculate the effective library size by multiplying the lib.size and the norm.factors together.

The calcNormFactors() function of edgeR normalizes for RNA composition by finding a set of scaling factors for the library sizes that minimize the log-fold changes between the samples for most genes. Since it is expected that only a fraction of the genes is differentially expressed, this normalization step allows for better prediction of differentially expressed genes. The default method for computing these scale factors with edge R uses a trimmed mean of M-values (TMM) between each pair of samples. After scaling our original library we obtain an effective library size. The effective library size replaces the original library size in all downsteam analyses. Without this step, the default value of the edge R norm factor is 1 for all samples.

Before we continue our DGE analysis, we generate a plot that shows the sample relations based on multidimensional scaling (MDS). MDS is a means of visualizing the level of similarity of samples. The basic premise is that we make a plot so samples which are similar are near to each other in the plot while samples that are dissimilar are far from each other. Here is an example.

As expected, we see that samples in group A and B separate out, since they come from very different cell populations.

The first major step in the analysis of DGE data is to estimate the dispersion parameter for each gene, a measure of the degree of inter-library variation for that tag. Estimating the common dispersion gives an idea of overall variability across the genome for this dataset. Here we assume that everything has the same common dispersion.

EdgeR refers to the dispersion estimate as the "biological coefficient of variation" (technical biases are also included in this estimate).

plotBCV() plots the tagwise biological coefficient of variation (square root of dispersions) against log2-CPM.

The biological coefficient of variation is lower than normally seen in human studies (~0.4) because the samples are technical replicates.

After estimating the dispersion, we use the function exactTest() to evaluate whether genes are expressed differentially or not. The function tests for differential expression between two classes using a method similar in idea to the Fisher's Exact Test. The test results for the n most significant tags can be displayed with the topTags() function. Benjamini and Hochberg's algorithm is used to control the false discovery rate (FDR).

The MA plot plots the log2 fold change on the y-axis versus the average log2 counts-per-million on the x-axis. Differentially expressed genes with an FDR less than 10% are shown in red, and the blue lines represent a four-fold change in expression.

As we would expect from prior knowledge about our samples and from the Multidimensional scaling results, a large fraction of the genes are differentially expressed between the two groups.

Schurch, N. J., Schofield, P., Gierliński, M., Cole, C., Sherstnev, A., Singh, V., … Barton, G. J. (2016). How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA, 22(6), 839–851. https://doi.org/10.1261/rna.053959.115 (table)

https://gist.github.com/jdblischak/11384914 (edgeR intro)

https://web.stanford.edu/class/bios221/labs/rnaseq/lab_4_rnaseq.html (edgeR intro)

by Kevin Mizes (2018), edited by Gloria Ha (2019)

As Sean has gone through in his lectures, we need some way to account for errors in our tests. With the **false discovery rate (FDR)**, we can take into account the fact that a certain number of our tests will be incorrectly discovered. By definition, the FDR is defined as the proportion (not rate) of false discoveries among the total number of discoveries, where a discovery is a test that has rejected the null hypothesis.

We collect the data below (reproduced from lecture)

Condition 1 - Null hypothesis true | Condition 2 - Alternative hypothesis true | Total | |
---|---|---|---|

Test is significant (discovery) | $x$ | $r-x$ | $r$ |

Test is not significant (no discovery) | $n-x$ | $m-(r-x)$ | $m+n-r$ |

Total | $n$ | $m$ | $m+n$ |

By definition, we have:

$$\mathrm{p\,value} = \frac{x}{n}$$(probability of rejecting the null hypothesis if it is true). We also have $$\mathrm{FDR} = \frac{x}{r}$$

Recall that these varaibles can also go under the names true positive (TP), false positive (FP), true negative (TP), false negative (FN).

Condition 1 - Null hypothesis true | Condition 2 - Alternative hypothesis true | |
---|---|---|

Test is significant (discovery) | FP | TP |

Test is not significant (no discovery) | TN | FN |

Which then tells us $\mathrm{FDR} = \frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TP}}$

For example, let's say we have some heart rate data, and want to detect if someone has Atrial fibrillation (Afib). Let's say our test is 95% accurate, but lifetime risk of Afib is ~25%. We test have data from 1000 heartbeats and our table looks something like:

Healthy heartbeat | Afib | Total | |
---|---|---|---|

Test detects Afib | 38 | 225 | 263 |

Test detects healthy | 712 | 25 | 737 |

Total | 750 | 250 | 1000 |

The FDR is 38/263 ~= 0.144

In the Bayesian interpretation, we simply add a conditional dependence: what is the posterior probability that the null hypothesis is true ($H_0=1$), given the test returns a discovery? We just use Bayes' rule:

$$P(H_0=1 | T=+) = \frac{P(T=+ | H_0=1) P(H_0=1)}{P(T=+)} = \frac{p_i \; P(H_0=1)}{p_i \; P(H_0=1) + (1-p_i) (1-P(H_0=1))}$$By definition, our p-value is the probability our test is positive given the null hypothesis is true. Let's plug in numbers for heart disease.

$$P(\mathrm{Healthy} \; |\; \mathrm{Test} = \mathrm{Afib}) = $$$$\frac{P(\mathrm{Test} = \mathrm{Afib} \;|\; \mathrm{Healthy}) P(\mathrm{Healthy})}{P(\mathrm{Test} = \mathrm{Afib} \;|\; \mathrm{Healthy}) P(\mathrm{Healthy}) + P(\mathrm{Test} = \mathrm{Afib} \;|\; \mathrm{Not\, Healthy}) P(\mathrm{Not\, Healthy})}$$$$= \frac{38/750 \cdot 750/1000}{38/750 \cdot 750/1000 + 225/250 \cdot 250/1000} = 0.144$$Benjamini and Hochberg formalized their false discovery rate as the expected value of $x/R$, the proportion of false discoveries in total discoveries. However, they noted there was some problem if there were no discoveries, and so more formally, they expand to the conditional probability (or expected value here).

$$ \mathrm{FDR} = E\left[\frac{x}{R} \; \bigg| \; R > 0\right] P(R > 0) $$However, this is distinct from what another researcher, John Storey defines as the positive FDR

$$ \mathrm{pFDR} = E\left[\frac{x}{R} \; \bigg| \; R > 0\right]$$The first equation is the proportion which false discoveries occur, and the second is the proportion that discoveries are false.

For example, you can set an FDR = 0.1, so that only 10% of your discoveries that occur are false, but if the probability of even having a discover is 0.5, then your pFDR threshold is 0.2 (the proportion of false discoveries).

With most of the data in genomics, $P(R>0)\approx 1$ so both estimates are approximately equal, so I don't think I'm going to discuss it more than this.

An issue arises when we have multiple hypotheses and a number of p-values. What if we tested for 1000 different heart diseases with this heart rate data? Some p-values (by definition) should be less than our significance threshold, so we need to control for the fact that we will see a number of false positives (sometimes called the look-elsewhere effect, which can lead to misleading news stories about how chocolate is good for you).

For multiple tests, we'll have a familywise error rate (FWER). For some threshold of significance $\alpha$ (Fisher set to 0.05). To correct for $m$ tests, we must accept or reject at a significance level $\frac{\alpha}{m}$. Or, our adjusted p-value becomes:

$$\widetilde{p} = m \cdot p$$For example, if we test 20 heart rates, what is the probability that due to chance, we'll observe at least one case of Afib from healthy data? We've calibrated our test so that it will return a false positive 5% of the time, (95% accurate).

$$ \begin{aligned} P(\mathrm{discovery} \geq 1 \;|\; P(\mathrm{healthy}_{1:20}) &= 1-P(\mathrm{no\,discovery} < 1 \;|\; \mathrm{healthy}_{1:20})\\ &= 1 - (.95)^{20} = .641 \end{aligned} $$With the Bonferroni correction, we must recalibrate our test so that the false positive only occurs 5%/20 = .25% of the time. (99.75% accurate!)

$$ \begin{aligned} P(\mathrm{discovery} \geq 1 \;|\; \mathrm{healthy}_{1:20}) &= 1-P(\mathrm{no\,discovery} < 1 \;|\; \mathrm{healthy}_{1:20})\\ &= 1 - (.9975)^{20} = .0488 \end{aligned} $$We can look at this more generally too. If we want to control for the fact that out of $n$ tests, there is only a 5% chance that there will be an accidental false discovery, we would write down:

$$ \begin{aligned} P(\mathrm{discovery} \geq 1 \;|\; \mathrm{healthy}_{1:n}) &= 1-P(P(\mathrm{no\,discovery} < 1 \;|\; \mathrm{healthy}_{1:n})\\ &= 1 - (1 - P(\mathrm{discovery} < 1 \;|\; \mathrm{healthy}))^n \\ &= 1 - \left(1 - \frac{0.05}{n}\right)^n \end{aligned} $$Which we can plot below.

In [1]:

```
# import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# display plots in the notebook
%matplotlib inline
# enable high res graphics inline
%config InlineBackend.figure_formats = {'png', 'retina'}
```

In [2]:

```
# set up n, alpha, P, Bonferroni P
n = np.arange(1,10)
alpha = 0.05
P = 1-(1-alpha)**n
PBonferroni = 1 - (1-alpha/n)**n
# plot Bonferroni P and uncorrected P
plt.plot(n,P,'r',label='no correction');
plt.plot(n,PBonferroni, 'b', label='Bonferroni');
plt.xlabel('n');
plt.ylabel('Prob');
plt.title('Prob of at least 1 discovery among healthy patients');
plt.legend();
plt.show();
```

The Bonferroni correction, while protects rejecting the null hypothesis when it really is true, is conservative, and does lead us to accept the null hypothesis more when it really is false. In Holm's method, instead of scaling our significance level by number of tests $m$, we scale by number of tests where the p-value is greater than our current p-value. Thus, for $j=1:m$, $\tilde{p_j} = (m-j+1) * p_j$

What makes this a step down proceudre is how we reject the null hypotheses. Starting at $j=0$, we adjust the p-value, and then make a comparison to our significance value $\alpha$. If $\tilde{p_j} \leq \alpha$, we continue, otherwise we stop, and reject ever null hypotheses $\leq j$ and accept those $ > j$

Let's say we're our significance threshold is 0.05, and we run four tests, giving p-values of 0.0121, 0.0142, 0.0391, and 0.1986. Using Holm's method, we first test $p_1$, $0.0121 < 0.05/4 = 0.0125$ and reject the null hypothesis. Since we have no rejected the hypothesis for observation 1, we only want to consider the remaining 3 hypotheses, and for our next comparison, we check $0.0142 < 0.05/3 = 0.0167$ which is also rejected. Next we compare $0.0391 < .05/2 = 0.025$. After this comparison we stop, and reject the null for our first two tests, but not for the second two.

This is another correction very similar to Holm's method, which basically runs the same process in the other direction.

From $j=m:1$, $\tilde{p_j} = (m-j+1) * p_j$. If $\tilde{p_j} < \alpha$, reject hypotheses $1:j$, else continue.

Going from this direction, Hochberg's will reject more null hypotheses than Holm's.

In [3]:

```
# let's code up the Holm's step up and Hochberg step down
# initialize a list of p-values in ascending order, its length, and set alpha = 0.05
p = [0.0021, 0.0044, 0.0093, 0.0134, 0.0157, 0.0389, 0.0758]
n = len(p)
alpha = 0.05
# reject null hypotheses with Holm's method
j = 0
while p[j] <= alpha/(n-j):
print('Holm\'s reject null hypothesis {} with adjusted pvalue {}'.format(j+1, p[j]*(n-j)))
j += 1
if j >= n:
break
# accept null hypotheses with Hochberg's method
j = n
while p[j-1] > alpha/(n-j+1):
print('Hochner\'s accept null hypothesis {} with adjusted pvalue {}'.format(j, p[j-1]*(n-j+1)))
j -= 1
if j == 0:
break
# print all adjusted p-values
padj = [p[j] * (n-j) for j in range(n)]
print('Adjusted p-values: {}'.format(padj))
```

We see that out of 7 p-values, Holm's rejected 3 null hypotheses, while Hochberg's method accepted 2 (rejected 5).

In 1995, work began on using the FDR as a more useful metric for significance. This results in fewer false positives, and we are looking at a fundamentally different number. P-value is given that the null hypothesis is true, and FDR is given that the test is significant.

Sean explains how we can estimate the expected value of the FDR in lecture 6:

```
Suppose you tested a million samples to get your 80,000 positives, at a per-test p-value threshold of <0.05. By the definition of the p-value you expected up to 50,000 false positives: maybe all the samples are in fact from the null hypothesis, and at a significance threshold α=0.05, you expect 5% false positives. So if you trust your numbers, the remaining 30,000 samples are expected to be true positives. You could say that the expected fraction of false positives in your 80,000 positives is 50000/80000 = 62.5%.
```

In this example we get a table like this, with a lot of unknowns

Null hypothesis | Alternative hypothesis | Total | |
---|---|---|---|

Positive test | Expect 50000 | 80000 | |

Negative test | |||

Total | 1000000 |

In the controlling procedure, we make a similar estimate to Holm's and Hochberg's procedure above. Instead of, like Bonferonni, scaling by the total number of tests, we scale by the number of tests less than our $jth$ p-value, $p_j$. This is because if we choose $p_j$ to be our significance cutoff, we'd expect $(m+n)\cdot p_j$ false positives from the total number of tests, and anything less than $p_j$ is just the number of positive tests $r$. Thus $$\mathrm{FDR} = x/r = p_j \cdot (m+n) / j$$

To make the a true step up procedure, we stop and reject when our adjusted FDR values fall below some threshold $\alpha$.

In [4]:

```
# use Rscript to run Wiggins' R script using mydata.tbl file in hw13
!Rscript analyze_W.r
```

Seems like that worked! Let's take a look at the results.

In [5]:

```
# load results in dataframe and display
data=pd.read_table('myresult.out',sep='\s+')
data.head(10)
```

Out[5]:

As a sanity check, let's see what the expected FDR would be if we arbitrarily set the p-value for gene `tomato`

as a threshold, and compare it to the FDR value for `tomato`

in the edgeR results file.

In [6]:

```
# store p-values in array, length of array
pvals = data['PValue'].values
n = len(pvals)
# set p-value threshold to the p-value of tomato
pr = data.loc['tomato'].PValue
FDR_edgeR = data.loc['tomato'].FDR
# calculate how many p-values are below the threshold
r = np.where(pvals < pr)
r = len(r[0])
# calculate FDR
FDR = n*data.loc['tomato'].PValue / r
print('Expected FDR for p-value threshold of {0:.3f}: {1:.3f}'.format(pr, FDR))
print('Compare to edgeR result: {0:.3f}'.format(FDR_edgeR))
```

We get an expected FDR of 58.7% if we set the p-value threshold to 0.105, which is very close to what edgeR produced.

Let's take a look at how many p-values would be considered significant by the Benjamini-Hochberg cutoff, keeping the expected FDR at 5%. In order to do this, we first calculate the number of positive tests ($r$) if we set the significance cutoff at each p-value, and then plot the BH significance cutoff.

In [7]:

```
# calculate r (number of positive tests) if we set the cutoff at each p-value
pvals = data['PValue'].values
rvals = []
for pr in pvals:
r = np.where(pvals < pr)
rvals.append(len(r[0]))
# plot P_r vs r
fig, ax = plt.subplots(figsize=(10,8))
ax.plot(rvals,pvals, 'ro',label='pvalues');
ax.set_ylabel('p-value');
ax.set_xlabel('r');
# plot fit line for our FDR cutoff
alpha = 0.05
n = len(pvals)
x2 = n
y2 = alpha
ax.plot([0, x2], [0, y2], color='k', linestyle='-', linewidth=2,label='BH significance cutoff');
ax.legend();
plt.show();
```

We see that very few genes are considered significant (certainly much fewer than 1905), and that the FDR cutoff is much stricter than a simple $p<0.05$.

The above method makes the assumption that everything comes from the null hypothesis (thus, we say $x$, false positives is the *total* tests times p-value). However, we might be pretty sure our alternative hypothesis is true in some cases. Thus, to find the $\mathrm{FDR} = x/r$, we need to find $x = p \cdot n$ (instead of $m+n$), where x is number of misdiscoveries, p is p-value, and n is number of tests from the null distribution.

How do we estimate $n$? One cool trick is to use the distribution of all our p-values. Under the null hypothesis we'd expect all of our p-values to be uniformly distributed. Under the alternative hypothesis, we should have a skewed distribution. Together we get a mixture that might look something like the graph below (taken from Storey and Tibshirani 2003).

Here, the dashed line is the density histogram of the uniform distribution if all the genes were from the null distribution. The dotted line is the estimated height of uniform distribution of the null p-values. The blue line ($\lambda=0.5$) is an example cutoff of where to begin estimating the truly null distribution.

For small p-values, we can assume that the mixture comes more strongly from the alternative hypothesis, but for larger p-values we can assume it comes only from the null. To estimate $n$, the number of tests from the null distribution, we simply count tests in the null distribution, high p-value regime and normalize.

$$ \hat{n} = \frac{\sum_j I(p_j > \lambda)}{(1-\lambda)} $$Where $p_j$ is the $jth$ p-value, $I()$ is the indicator function (1 if True), and $\lambda$ is a threshold we set to assume p-values are coming from only the null hypothesis. This is often represented as the proportion of null p-values, $\hat{\pi_0} = n / (n+m) = \sum_j I(p_j > \lambda) / ((n+m)(1-\lambda))$.

Then, for some threshold $\alpha$,

$$ \hat{FDR_i}(\alpha) = \frac{\hat{\pi_0} (m+n) \alpha}{\sum_j I(p_j < \alpha)} = \frac{\hat{n} \alpha}{\sum_j I(p_j < \alpha)} $$If we pick $\alpha = p_i$ for each of our $i$ p-values, and ordered $p_1 \leq p_2 \leq ... \leq p_{m+n}$, this is the same as above, just replacing our estimate $\hat{n}$ for $m+n$.

Storey defines the estimated q-value now as

$$ \hat{q(p_j)} = \mathrm{min}_{\alpha \geq p_j} \hat{FDR(\alpha)} $$Where we tune our $\alpha$ threshold to set the best FDR for each point. Calculating the q-value in a step-up procedure is easy. For sorted p-values $p_1 ... p_m$, $m$ total tests, $n$ tests that are from the null distribution, and threshold $\alpha$:

$$ \hat{q}(p_m) = \mathrm{min}_{\alpha \geq p_m} \frac{\hat{n} \alpha}{\sum_j I(p_j < \alpha)} = \frac{\hat{n} p_m}{m} $$And for stepping up from $i=(m-1):1$,

$$ \hat{q}(p_i) = \mathrm{min}_{\alpha \geq p_i} \frac{\hat{n} \alpha}{\sum_j I(p_j < \alpha)} = \mathrm{min}\big( \frac{\hat{n} p_i}{i}, \hat{q} (p_{i+1})\big) $$