RNA-seq is the technology of choice for genome-wide differential gene expression (DGE) experiments.
The starting point for an RNA-Seq experiment is a set of n RNA samples, usually associated with different treatment conditions. Each sample is sequenced and the resulting short reads are mapped to an appropriate reference genome. The number of reads mapped to each genomic feature (such as gene, exon or transcript) of interest is counted. The number of reads of a sample i that are mapped to gene g are denoted ygi.
Popular software tools for RNA seq analysis include baySeq, cuffdiff, DEGSeq, DESeq, DESeq2, EBSeq, edgeR (exact and glm modes), limma, NOISeq, PoissonSeq, and SAMSeq. Out of those, edgeR and DESEq 2 are the most widely used tool due to their superior TP identification rate and well-controlled FDR at lower fold changes.
For this class we use the software edgeR. This section provides a short edgeR tutorial which can be run by copying the code below into R. The code is divided into .. parts, which are described in the section notes. Running the code requires that you downloaded R, BioConductor and edgeR for which instructions are provided in the description of homework 8.
# RNA-seq analysis script library(edgeR) ### 1. Downdload and clean data # download data set fname <- "http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE49712&format=file&file=GSE49712_HTSeq.txt.gz" download.file(fname, destfile = "GSE49712_HTSeq.txt.gz") data_raw <- read.table("GSE49712_HTSeq.txt.gz", header = TRUE) #explore # Each each row corresponds to a gene, and each column corresponds to a sample. dim(data_raw) head(data_raw) tail(data_raw) #Notice that the last five lines contain summary statistics and thus need to be removed prior to testing. data_clean <- data_raw[1:(nrow(data_raw) - 5), ] ### 2. Remove genes with low expression #create two groups group <- substr(colnames(data_clean), 1, 1) group d <- DGEList(counts = data_clean, group = group) dim(d) # 21711 x 10 head(cpm(d)) apply(d$counts, 2, sum) # total gene counts per sample keep <- rowSums(cpm(d)>100) >= 2 d <- d[keep,] dim(d) # 3605 x 10. #reset the library sizes after filtering. d$samples$lib.size <- colSums(d$counts) d$samples ### 3. Normalization d <- calcNormFactors(d) d ### 4. Data exploration cpm_log <- cpm(d, log = TRUE) #heatmap(cor(cpm_log)) #MDS plot plotMDS(d, method="bcv", col=as.numeric(d$samples$group)) ### 5. estimate the dispersion d <- estimateDisp(d) sqrt(d$common.dispersion) # biological coefficient of variation, 0.0822604 plotBCV(d) ### 6. exact test et <- exactTest(d) dim(et) results_edgeR <- topTags(et, n = nrow(data_clean), sort.by = "none") head(results_edgeR$table) #How many genes are differentially expressed at an FDR of less than 10%? sum(results_edgeR$table$FDR < .1) plotSmear(et, de.tags = rownames(results_edgeR)[results_edgeR$table$FDR < .1]) abline(h = c(-2, 2), col = "blue")
edgeR works on a table of integer read counts, with rows corresponding to genes and columns to independent libraries (RNA samples). The counts represent the total number of reads that map to each gene.
Each function in edgeR has its own online help page. For example, a description of the exactTest function can be obtained by typing ?exactTest or help(exactTest) at the R prompt.
We are going to use the data set generated by the Sequencing Quality Control (SEQC) project. The data set consists of two groups of data:
Group A is five technical replicates of the Stratagene Universal Human Reference RNA, which is a pool of ten human cell lines.
Group B is five technical replicates of Ambion's Human Brain Reference RNA, which is RNA samples pooled from multiple donors from different brain regions.
First, we get rid of genes which are not expressed or expressed at very low levels. Here I use a cutoff of 100 counts per million and only keep genes that have a cpm of 100 or greater for at least two samples. Counts per million are calculated with the edgeR function cpm() in R. At this step, we create a vector called "DataGroups" which labels each of the column. We then use this vector and the gene counts to create a DGEList, the object that edgeR uses for storing data from a differential gene expression experiment.
There are alternative ways to filter genes. Another simple method would be to choose a cutoff based on the median log log~2~-transformed counts per gene per million mapped reads (cpm) and remove all genes with a median log~2~ cpm below the cutoff.
When we take a look at the dimension of our data table (dim(d)), we see that this filtering step reduces the dataset from around 21711 genes to about 3605. For the filtered genes, there is very little power to detect differential expression, so little information is lost by filtering.
We reset the library size (lib.size) after filtering. EdgeR will use the lib.size at later steps to calculate the effective library size by multiplying the lib.size and the norm.factors together.
The calcNormFactors() function of edgeR normalizes for RNA composition by finding a set of scaling factors for the library sizes that minimize the log-fold changes between the samples for most genes. Since it is expected that only a fraction of the genes is differentially expressed, this normalization step allows for better prediction of differentially expressed genes. The default method for computing these scale factors with edge R uses a trimmed mean of M-values (TMM) between each pair of samples. After scaling our original library we obtain an effective library size. The effective library size replaces the original library size in all downsteam analyses. Without this step, the default value of the edge R norm factor is 1 for all samples.
Before we continue our DGE analysis, we generate a plot that shows the sample relations based on multidimensional scaling (MDS). MDS is a means of visualizing the level of similarity of samples. The basic premise is that we make a plot so samples which are similar are near to each other in the plot while samples that are dissimilar are far from each other. Here is an example.
As expected, we see that samples in group A and B separate out, since they come from very different cell populations.
The first major step in the analysis of DGE data is to estimate the dispersion parameter for each gene, a measure of the degree of inter-library variation for that tag. Estimating the common dispersion gives an idea of overall variability across the genome for this dataset. Here we assume that everything has the same common dispersion.
EdgeR refers to the dispersion estimate as the "biological coefficient of variation" (technical biases are also included in this estimate).
plotBCV() plots the tagwise biological coefficient of variation (square root of dispersions) against log2-CPM.
The biological coefficient of variation is lower than normally seen in human studies (~0.4) because the samples are technical replicates.
After estimating the dispersion, we use the function exactTest() to evaluate whether genes are expressed differentially or not. The function tests for differential expression between two classes using a method similar in idea to the Fisher's Exact Test. The test results for the n most significant tags can be displayed with the topTags() function. Benjamini and Hochberg's algorithm is used to control the false discovery rate (FDR).
The MA plot plots the log~2~ fold change on the y-axis versus the average log~2~ counts-per-million on the x-axis. Differentially expressed genes with an FDR less than 10% are shown in red, and the blue lines represent a four-fold change in expression.
As we would expect from prior knowledge about our samples and from the Multidimensional scaling results, a large fraction of the genes are differentially expressed between the two groups.
Schurch, N. J., Schofield, P., Gierliński, M., Cole, C., Sherstnev, A., Singh, V., … Barton, G. J. (2016). How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA, 22(6), 839–851. https://doi.org/10.1261/rna.053959.115 (table)
https://gist.github.com/jdblischak/11384914 (edgeR intro)