# the adventure of the missing phenotype

The sand mouse gene Coriander is supposed to encode an important transcription factor. However, in a new paper from Lestrade et al. in the Sand Mouse Journal, an RNA-seq analysis of wild-type versus Coriander mutant sand mice shows that loss of Coriander function results in no significant effects on the mean expression of any sand mouse gene transcript (other than Coriander itself, of course).

For example, Figure 2 from Lestrade's paper shows mean expression levels and standard deviations for four sand mouse genes previously believed to be direct targets of the Coriander transcription factor (Arugula, Cayenne, Juniper, and Yam), showing no expression differences. Lestrade also showed that the Coriander mutant is indeed a knockout, null for RNA expression. The figure is based on RNA-seq measurements on ten biological replicates each (wow!), in wild type versus mutant sand mice.

You decide to dig a little deeper into Lestrade's data. Fortunately, Lestrade's lab, though somewhat bumbling, does seem to practice reproducible computational science, because they've made their data available in their Supplementary Data Table 1: w03-data.tbl.

Download the data file, and have a look. It's a column-justified, whitespace-delimited file. Looks like the table starts with some comment lines and commented column headers, followed by one line of RNA-seq expression levels (in units of TPM, transcripts per million) per gene, with 21 fields per line: the gene name, 10 replicates for wild type, and 10 replicates for mutant.

### 1. downsample the data set by reservoir sampling

Write a script that takes a random sample of 10 data (non-comment) lines, using the reservoir sampling algorithm.

(Obviously Pandas can take a random sample easily, with pd.sample(). You can use Pandas to read the table if you want, but implement your own reservoir sampling algorithm; don't use pd.sample().)

### 2. look at outliers; validate the formatting

Write a script that checks over the whole file and:

• Find and print the data line that contains the maximum expression level.
• Find and print the data line that contains the minimum expression level.
• Find lines that don't have the right format. ("Right" is deliberately left vague. Validation is a creative art!)

### 3. clean the data

Write a script that removes the data lines that have the problems you just detected.

(hint: there are two problems in the data file, affecting 59 data lines. Your cleaning should leave you with 19,972 data lines for that many sand mouse genes, not counting comment/header lines.)

### 4. tidy the data

Write a script that converts the data table to a new file in tidy data format.

Both Pandas and Seaborn (which we're about to use next) are happiest with tidy data.

### 5. visualize the data

Write a script to read your tidy data file and visualize the distribution of the raw data points using beeswarm plots (also known as swarm plots) -- that is, 20 points per gene, 10 biological replicates for wild type versus mutant -- for any list of chosen sand mouse genes, and use it to display a plot of 10 randomly sampled genes. The Seaborn package has a good swarm plot function... which is why we're using Seaborn.

Also have a look at Seaborn's catplot function too (i.e. a categorical plot, which they used to call a factor plot), which gives you the ability to plot lots of genes in a "facet grid" in the same figure.

Instead of a swarm plot, another way to do it (which I might actually prefer, now that I've been playing with it for a while) is a Seaborn strip plot with jitter=True.

(Helpful side note: we're only introducing you to data plotting this week. Learning Pandas in some depth is the bigger goal. If you aren't familiar with Matplotlib or Seaborn for plotting, that's fine -- don't start reading their manuals or tutorials yet. Instead, skip to the "massive hint" below. All you're really going to have to do here is take a working Seaborn script and figure out how to get it to plot what you want.)

What's going on in the Coriander mutant?

Email a Jupyter notebook page (a .ipynb file) to mcb112.teachingfellow@gmail.com. Please name your file <LastName><FirstName>_MCB112_<psetnumber>.ipynb; for example, mine would be EddySean_MCB112_03.ipynb.
• There's two reasonable ways to make the 'tidy' data set, as we discussed in lecture, with different advantages and disadvantages. One form ('wide'), with each individual gene as a variable, is only mostly-tidy, but makes more sense for many kinds of analysis; another form ('long') with gene_name as a variable is fully tidy in Wickham's sense, and is easier to manipulate and plot in a Seaborn factorplot, at least for this exercise. Hint: check out the Pandas melt method, which allows you to convert any number of columns (measured variables - like, oh, say, 20,000 columns labeled as gene names for example) into a single variable in one column ("gene", for example).
• The script with the massive plotting hint also shows an example of using the Pandas melt method to convert from "wide" to "long" tidy format for the data.