MCB112: Biological Data Analysis (Fall 2017)

homework 08:

Lestrade’s lost labels

You’re still sorting out the mess that Lestrade left behind when he quit the lab. In another set of experiments he was doing, it looks like he had isolated a mutant sand mouse in a screen, and characterized the mutation’s effect on overall gene expression in RNA-seq experiments. He did two sets of three replicates on wild type control mice, and one set of three replicates on his mutant sand mice. But unfortunately, he mislabeled the files and lost track of which was which. He’s still not responding to your emails, so you’re going to have to do some detective work.

Lestrade’s data and scripts

You’ve found Lestrade’s three data files, each of which contains mapped count data for three replicates each: w08-data.1, w08-data.2, and w08-data.3. Two of these files are from wild type samples, and one is from the mutant samples.

You’ve also found the R script he apparently used to do his analysis: analyze_GL.r. His analysis pipeline used the well-regarded edgeR package for differential gene expression analysis. His R script takes an input file (called mydata.tbl, imaginatively) with 6 samples, three wild type and three mutant. He must’ve put two of the data files together to do a differential analysis of 3 wt vs. 3 mutant samples, but his merged input file has been lost.

His notebook says he identified 2105 differentially expressed genes significant at P<0.05.

Your task is to reproduce his work, figure out what the missing labels on those files are, and see if you believe his result of 2105 significant differentially expressed genes.

install R, BioConductor, and edgeR

Differential gene expression analysis methods are almost invariably written in R and made available as BioConductor packages, including all methods considered to be the current best: edgeR, DESeq2, limma, EBSeq, and Sleuth. Of these, I chose edgeR semi-arbitrarily as our example to look deeper into.

This is an example of how it’s useful to be able to do biological data analysis in both Python and in R. A goal of this homework is to install and use some R code, to see it in action, though without really learning it.

First, install R, if you don’t have it on your system already. (Linux systems typically do. Type the command R to check.) If you are on OS/X, I recommend downloading either R-3.4.2.pkg if you’ve got OS/X 10.11 (El Capitan) or later, and R-3.3.3.pkg if you have OS/X 10.9 (Mavericks) or later. I’m running OS/X 10.10 Yosemite myself, so I got R-3.3.3.pkg.

Once you’ve done that, typing R will start the R command line environment; the command q() will quit, after it asks if you want to save your current R workspace:

% R

R version 3.3.3 (2017-03-06) -- "Another Canoe"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin13.4.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> q()
Save workspace image? [y/n/c]: n


Start R. At the R command line, use BioConductor to install its base BioBase packages, plus the edger package:

% R
> source("")
> biocLite("Biobase")
> biocLite("edgeR")
> q()

You can find more information at BioConductor’s pages for the edgeR package, and for BioBase.

You should be all set, but if not, see the hints section for more information about installing R and edgeR.

but we’re using Python

Part of the point of this homework is to learn a poor hacker’s way of working in two different languages, such as Python and R: communicating through files. We can have our Python script write out an R script and its inputs as files, run the R script (for example using Jupyter Notebook to run Rscript as an external command) to produce output file(s), then parse the output datafile back into our Python flow.

(Alternatively… we could just do our work in R! But Python’s plenty for us to tackle this semester.)

1. write a python function to run an external edgeR analysis

Write a python function that takes the name of an input counts file as an argument (and any other arguments you need), and returns the results of an edgeR analysis: gene names, log fold changes, log CPM, P-values, and FDRs (and any other data you find you want to return).

You can use the poor hacker’s style: have your python script write an R script to a file, run the R script externally with Rscript (see hints), and parse the resulting output file.

Use the same edgeR analysis steps that Lestrade’s analyze_GL.r script used.

2. reproduce Lestrade’s data, assign the missing labels

There are three possible combinations of Lestrade’s data files: (1,2), (1,3), or (2,3). Using your Python function for running his edgeR analysis, run all three analyses.

Which combination did he run to obtain his result of 2105 differentially expressed genes significant at P<0.05?

Which of the three files corresponds to the mutant sand mouse samples? Why?

3. Lestrade doesn’t understand p-values

Do you agree with Lestrade’s conclusion that 2105 genes are differentially expressed in that wt vs. mutant comparison? What did he fail to do?

Give a different conclusion of your own – what is a more appropriate statistical cutoff, and how many genes are called differentially expressed at your threshold?

4. Lestrade missed something else too

Lestrade’s analysis has a subtler problem. It’s missing an important step in the edgeR analysis pipeline, and it just happens (!) that this example exercises exactly the problem that that edgeR step is designed to deal with. Find it and fix it – you’ll need to add one step in the R analysis script – and rerun the analysis. Now how many genes do you think are differentially expressed? What was the problem? (see hints)

turning in your work

Email a Jupyter notebook page (a .ipynb file) to Please name your file <LastName><FirstName>_MCB112_<psetnumber>.ipynb; for example, mine would be EddySean_MCB112_08.ipynb.


   % brew install r

For other alternatives, see the R Project.

Make sure you have R installed and working before you install any R packages. You should be able to type R at the command line prompt and get into the R environment’s prompt; q() quits.

If you do try to install R and other R packages with anaconda and it doesn’t work (which is what happened to me), you can uninstall with:

  % conda uninstall r-base
   Rscript analyze_GL.r

and it will create an output file myresult.out.

This is also a good way to make sure that you have R and edgeR installed and working correctly.

   join -t $'\t' w08-data.1 w08-data.2 > merged.12

is an incantation that takes the two three-sample files w08-data.1 and w08-data.2 as input, and outputs one merged line for each corresponding line pair in the two files: the first field (the gene name, the join field), followed by the data columns of file1, followed by the data columns of file2. The -t $'\t' option says to output the merged data as tab-delimited data; the funny-looking $'\t' is a trick called ANSI quoting.

Of course if that’s too weird looking for you, you can always whip up some Python to read the two input files and output the merged file you want.