MCB112: Biological Data Analysis (Fall 2016)


homework 01:

the case of the dead sand mouse

A new paper from Watson et al. describes the “thanatotranscriptome” of the sand mouse. They find that a large number of genes in the prefrontal cortex of the sand mouse are differentially upregulated… after the mouse dies.

the experiment

The experiment they did was to instantly kill the mouse at t=0, and then at t=0, 12, 24, 48, and 96 hours, dissect out the prefrontal cortex, prepare poly-A+ mRNA, and do RNA-seq. An example of the results is shown in Figure 1 from Watson et al.: carrot goes down after death, KDELR3 comes up at 24hrs then goes down, ARVCF peaks at 48h, and APOBR is still coming up at 96h:

Figure1

In their discussion, the authors say that their data provide evidence for an ancient program of cortical gene expression that causes the sand mouse’s life to flash before its eyes (very slowly).

You suspect an artifact of some kind, and you decide to look into the result. After carefully reading the paper, it looks like there aren’t any obvious problems with the RNA-seq experiment itself. Each data point is an average over several dead sand mice, and the variation seems small, so it’s not that this is just experimental noise. The dissections seem carefully done, so it’s not that they mistakenly collected anatomically different chunks of brain (which could easily give you gene expression differences). They also did some controls to make sure that the relative proportions of different cell types in the dissected tissue hadn’t changed much (a common problem in differential RNA-seq analysis of tissues composed of mixtures of cell types: you measure a population average, so if cell type composition changes, it can look like gene expression changes.) Everything suggests to you that the expression level (TPM) calls are reproducible and robust, so if there’s a problem it’s in the interpretation of the expression levels.

the data

The expression data for the paper are available in their Supplementary Data Table 1: Watson_SuppTable1.

It happens that you’ve also just been reading another paper, from Moriarty et al., who have systematically measured mRNA synthesis rates (in mRNA/hr) and mRNA halflife (in hr), also in the prefrontal cortex, for every gene in the sand mouse. Those data are available in their Supplementary Data Table 2: Moriarty_SuppTable2.

Download both data files, which you’ll need for the exercises below.

1. check that the gene names match

In principle, the gene names in the two data files ought to match, but (sigh) you know that all kinds of things can go wrong in practice: people use different names for the same gene, spell them with different capitalization or punctuation or added spaces, and so on.

Write a Python script to compare the gene names in the two data files. Output the names that appear in Watson_SuppTable1 but not Moriarty_SuppTable2, if any.

You’re especially suspicious of the Watson et al. data table because they mention in their paper that they exported the supplementary methods tables from Microsoft Excel, and you’ve recently run across people on Twitter discuss not just one but two terrifying papers about how Excel has systematically corrupted the supplementary data files in about 20% of published articles.

If there’s a difference - why?

2. explore the data

Write Python code to:

(And then you might want to explore the data in other ways too, on your own, as you’re doing the next part.)

3. figure out what happened

This is partly an exercise in manipulating data files line-by-line in Python, but it’s also designed to give you a Very Big Clue as to what happened in the dead sand mouse experiment… if you haven’t already guessed.

Write a Python script that merges the two data files, line by line, merging them on gene name. That is, for each line in file 1 for gene X, find the corresponding line for gene X in file 2; we’re going to write a single output file with one line per gene. The genes are in different orders in the files, so this merge isn’t entirely trivial. For any gene name X isn’t found in both files (cough cough Excel corruption cough cough) just skip it. For each gene name that is found in both files, output one whitespace-delimited, column-justified data line consisting of 7 fields per line:

Save your merged dataset to a file.

Merging data sets, even crudely like this, is often useful in exploring data and looking for problems, outliers, and correlations. Explore the data, however you want, for example by looking at the genes with the highest expression ratio t=96/t=0. What do you think is the real explanation for what happened in the dead sand mouse experiment?

turning in your work

Email Tim a Jupyter notebook page (a .ipynb file) that implements the three scripting steps, with your conclusions.


notes


hints

    # Remove comment lines from a data file:
    % grep -v "^#" Moriarty_SuppTable2
   
    # Sort a data file on the numbers in field #3, ignoring comments:
    % sort -n -k3 Moriarty_SuppTable2
   
    # Combine the two: ignore comments and sort the rest:
    % grep -v "^#" Moriarty_SuppTable2 | sort -n -k3
   
    # Sort non-comment lines in alphabetical order
    % grep -v "^#" Moriarty_SuppTable2 | sort 

    # Look at the first 10 lines of a data file:
    % head -10 Moriarty_SuppTable2
   
    # ...or the last 10:
    % tail -10 Moriarty_SuppTable2
   
    # Permute the order of the lines of a file:
    % shuf Moriarty_SuppTable2
   
    # Take a random subset of 10 lines from a file:
    % shuf Moriarty_SuppTable2 | head -10
   
    # Get a sorted list of names from each file into two tmp files;
    # look for names unique to file1 or file2. (i.e. solve problem 1)
    % grep -v "^#" Watson_SuppTable1   | awk '{print $1}' | sort > foo
    % grep -v "^#" Moriarty_SuppTable2 | awk '{print $1}' | sort > baz
    % comm -3 foo baz
   
    # Output five genes with highest mRNA synthesis rate (solve problem 2a)
    % grep -v "^#" Moriarty_SuppTable2 | sort -n -r -k2 | head -5
   
    # Output five genes with lowest mRNA halflife (solve problem 2b)
    % grep -v "^#" Moriarty_SuppTable2 | sort -n -k3 | head -5

    # Output the highest five tpm[t=96]/tpm[t=0] ratios
    % grep -v "^#" Watson_SuppTable1   | awk '{print $1, $6/$2}' | sort -nr -k2 | head -5

Examples of frequently used UNIX power tools on the command line:

On OS/X, shuf may not be installed for you, and other programs may exist but be somewhat braindead. Install GNU coreutils. For example, brew install coreutils, if you use Homebrew as a package manager.

To get brief help on using any of these, use man: i.e., man sort.