MCB112: Biological Data Analysis (Fall 2017)
 reading for this week
 on data science
 a data paranoia toolkit
 reservoir sampling algorithm
 displaying data in graphs and tables
 tidy data
 practicalities: a lightning intro to Pandas
week 03:
data exploration and visualization
reading for this week

David Donoho, 50 years of data science, unpublished manuscript (2015)

Hadley Wickham, Tidy data, Journal of Statistical Software (2014)

Tracey Weissgerber et al., Beyond bar and line graphs: time for a new data presentation paradigm, PLOS Biology (2015)
on data science
The term “data science” appears to have been introduced in its current sense by W. S. Cleveland in 2001 (Data science: an action plan for expanding the technical areas of the field of statistics, International Statistical Review, 2001) David Donoho’s 50 years of data science is an excellent summary of what “data science” means, especially from the perspective of how academic disciplines are organized, and how they change over time.
At first glance (and subsequent glances too) the term is a bit silly, a tautology. All sciences do data analysis. Even if you’re talking ‘big data’, Pierre Laplace was doing large scale data analysis on Paris and London census records in 1781 (which we’ll discuss next week). As the statistician Karl Broman has asked, when a physicist does a mathematical analysis, does she say she’s doing “number science”?
Donoho makes a pretty good case that “data science” is a rebranding exercise, and his piece gives interesting insight into disciplinary fad and fashion. But since we biologists are still in in the midst of “systems biology”, “integrative biology”, and “quantitative biology” (and a whole lot of omes) we can’t be throwing too many stones in any glass houses of branding “new” subdisciplines.
Beyond the politics of it, there must be something interesting and real about “data science”, or it wouldn’t be so high in our collective consciousness. Donoho makes a case that data science is a rebranding of statistics by outsiders, and that statistics departments opened the door to this schism by downplaying and avoiding the difficult, messy practicalities of realworld data analyses. Donoho repeats a famous quote that 80% of the effort in data analysis is just being able to deal with the messiness of data at all: reformatting, exploring, cleaning, visualizing. Computer scientists, in particular, have taken up those practical challenges by necessity and opportunity, and reinvigorated an old area. Even Donoho turns up his nose a bit at computational technologies that seem to him like mere “coping skills for dealing with organizational artifacts of largescale cluster computing”, but being able to compute effectively on terabytes or petabytes of data does present serious, expensive, and interesting practical challenges.
In biology, that adage about 80% of the effort in data analysis might even be an understatement. Biology is particularly ready for more emphasis on the key ideas of data science: the practicalities of data exploration, data cleaning, and data visualization, and the coding skills to be able to do it. This week, we’re going to focus on some of those practicalities.
a data paranoia toolkit
The first thing you do with a new data file is look at it. But, you say, come on, my file is too big to read the whole thing. Here’s a litany of basic techniques for dealing with big data files, aiming to get big unmanageable data sets down to little data sets that human, intuitive biological superpowers can work on:
 slice: look at the start and the end
 select: look at known cases
 sample: look at a random sample
 sort: look for outliers
 summarize: look for more outliers
 validate: the whole file, once you think you understand it.
We’ll focus here on simple lineoriented data files, but the same ideas apply to more complex formats with multiple lines per record.
look at the start, end
The best place to start is at the beginning. The start of the file tends to include helpful information like column headers, comments, and other documentation. If you’re going to parse the file, this might help – plus your parser might have to know to skip or specially deal with header material.
The easiest way to look at the start of a text file is on the command line:
% head 10 w03data.tbl
shows you the first 10 lines of the file. (Obviously, you can use this to get however many you want to read.)
Worth checking the tail end of the file too. It might end in a special kind of line that you’d need to parse. A corrupted file (in a truncated download) might end abruptly in an obvious way. Some files don’t end in a newline character, and that might screw up your parser if you aren’t expecting it. Again, the command line way is:
% tail 20 w03data.tbl
which shows you the last 20 lines.
Of course, if the file is small enough that you could just look at the
whole thing, for example using more
on the command line:
% more w03data.tbl
select known cases
The superpower of a biologist looking at a fresh new data set: ‘let’s look at my favorite gene… wait, that’s not right, oh good lord.’
Look at a few cases where you know the right answer. For instance, in a cell type specific RNAseq data set, look at a transcript that’s supposed to be on in that cell type, and look at one that’s supposed to be off. If you’re going to worry about crosscontamination in the data, look at highly expressed genes in other nearby cell types (the most likely crosscontaminants). If you’re going to worry about sensitivity of detecting rarer transcripts, find a lowexpressed transcript to look at.
For pulling specific lines out of a file, grep
is a goto. This
will print all lines in the file w03data.tbl
that contain the
string coriander
:
% grep coriander w03data.tbl
take a random sample
Many data files are arranged in a nonrandom order, which means that just looking at the start and the end might not reveal some horrific systematic problem. Conversely, looking at the head and tail can make it look like the data file’s worse than it actually is: genome sequence files, ordered by chromosome coordinate, can look terrible when they start and end with great swaths of N’s, placeholding for the difficulttosequence telomere repeats.
For small to medium sized files, if you don’t care about efficiency,
there are plenty of ways to randomly sample N lines, One quick
incantation (when you have a modern GNUish sort
program installed)
is to sort the file randomly (the R
option) and take the first N
lines:
% sort R w03data.tbl  head 10
For large files, where efficiency matters, we’re going to introduce the reservoir sampling algorithm in just a moment.
There’s a great Feynman story about the power of random sampling. Feynman was called in to consult on the design of a big nuclear plant, looking over engineering blueprints that he didn’t understand at all, surrounded by a crowd of actual engineers. Just trying to start somewhere and understand something small, he points randomly to a symbol and says “what happens if that valve gets stuck?”. Not even sure that he’s pointing to a valve! The engineers look at it, and then at each other, and say “well, if that valve gets stuck… uh… you’re absolutely right, sir” and they pick up the blueprints and walk out, thinking he’s a genius.
sort to find outliers
In a tabular file, you can sort on each column to look for weird
values and outliers (and in other data formats, you can script to
generate tabular summaries that you can study this way too). For
example, this will sort the data table on the values in the third
column (k3
), numerically (n
), and print the first ten:
% sort n k3 w03data.tbl  head
r
reverses the order of the sort, to look at the last ten.
sort
has many other useful options too.
Bad values tend to sort to the edges, making them easier to spot.
summarize to find more outliers
You can also find ways to put all the values in a column through some
sort of summarization. Here your goto is the venerable awk
, which
makes it easy to pull single fields out of the lines of tabular files:
awk '{print $3}' myfile
, the world’s simplest script, prints the
third field on every one of myfile
.
I have a little Perl script of my own called avg
that calculates a
whole bunch of statistics (mean, std.dev., range…) on any input that
consists of one number per line, and I’ll do something like:
% awk '{print $3}' w03data.tbl  avg
You don’t have my avg
script, but you can write your own in Python.
If the values in a column are categorical (in a defined vocabulary), then you can summarize all the values that are present in the column:
% awk '{print $2}' myfile  sort  uniq c
where the uniq
utility collapses identical adjacent lines into one,
and the c
option` says to also print the count, how many times each
value was seen.
For something sequential, like DNA or protein sequences, you can whip up a quick script to count character composition; you may be surprised by the nonDNA characters that many DNA sequence files contain.
Another trick is to count the length of things in a data record (like sequence lengths), sort or summarize on that, and look for unreasonable outliers.
validate
By now you have a good mental picture of what’s supposed to be in the file. The truly compulsive, which of course includes all careful scientists among us, might now write a validation script: a script that looks automatically over the whole file, making sure that each field is what it’s supposed to be (number, string; length, range; whatever).
Of course, you’re not going to do all of these paranoid steps on every data file you encounter. The point is that you can; it only takes basic scripting and command line skills to do some very powerful and systematic things at scale, enabling you (as a biologist) to wade straight into a large dataset and wrap your mind around it.
reservoir sampling algorithm
Suppose you’re trying to take a random sample of a really large data set. Suppose the data set is so large that it doesn’t fit in your computer’s memory. Suppose you don’t even know how large it is; you’re going to read one element at a time in a very long stream, until you run out of elements. Is it possible to sample elements from total elements, where $n$ is unknown (until the end), such that each element is sampled uniformly with probability ?
Algorithm: reservoir sampling

allocate a reservoir to hold up to elements, with “slots” .

put the first elements in reservoir slots .

for each subsequent element , starting with , up to the last one :
 choose a random number
 if , swap element into reservoir slot , discarding the element that was there. (That is, each element k initially gets put in the reservoir with probability ).
Our first complete algorithm in the course, I think! One of the things we want you to learn is how to read a pseudocode algorithm and implement it in your own Python code. You’ll find that one detail you always need to keep track of (obsessively) is that pseudocode algorithms often count with indexes starting from 1, but in Python, lists are indexed starting from 0: “offbyone” errors are easy to make.
Reservoir sampling is efficient in both time and memory. It becomes the preferred sampling algorithm when files get large. Writing your own script gives you the ability to ignore nondata parts of the file (here, comment lines) and to parse the file into data records any way you need to (not just one line at a time). Knowing how to write a quick reservoir sampling script gives you the superpower of being able to randomly downsample any data file, no matter how large the file, and no matter the format of the data.
proof that it works
We prove it in two cases. First consider the case of element , which gets into the original sample automatically. At each iteration to , you have a chance to knock it out of the reservoir, and the probability that it survives at iteration is . It has to survive every round, so:
which is what we want: the probability that i is sampled is . Obviously the key thing here is how every numerator neatly cancels its preceding denominator.
Now consider the case of an element . The probability that it gets into the reservoir when its turn comes is . Then it has to survive getting knocked out by any subsequent element , so
The same cancellation, but from a different initial condition.
Two points to make about this proof:

if you don’t remember anything about the proof, you’ll probably have to look up the reservoir sampling algorithm every time you implement it, because it’s easy to get muddled up on offbyone issues. But if you even remember the outline of the proof, you’ll remember that the final sample has to be sampled with probability , and that will remind you that the recursive rule is to sample each element with probability .

I looked this up on Google to double check that I had the proof right, and the first sample pages I looked at were complicated, confusing, and/or wrong.
computational complexity  big O notation
How many steps does the reservoir sampling algorithm take? It does a couple of operations for each element (sample a random number; compare it to ; maybe swap element into the reservoir). If we sample from a data set that’s 10x larger, it will take us 10x more operations. We say that the algorithm is time, “order of n in time”.
How much memory does it take? We only need to hold the reservoir’s elements in memory, plus one current element . We say the algorithm is O(m) memory, “order of m in memory”.
We saw that another way to take a random sample is to sort the elements randomly and take the first of them. Inmemory sorting algorithms are typically time and memory. ( terms often arise in algorithms that are doing some sort of binary divideandconquer, like recursively sorting elements to the left or right of some middle point.) Which algorithm should we prefer?
The bigO of these algorithms tells us that at some point, for large enough , reservoir sampling is the preferred algorithm in both time and memory, because is better than time, and is better than memory.
But: BigO notation is only about relative scaling as the problem size increases; it ignores constant factors. If algorithm A happens to require 1000000 operations per element and algorithm B does 1 operation per element, algorithm A is actually faster in practice until n gets really big.
It can be difficult to predict how long an individual operation takes, especially on a modern computer where CPU operations are blazingly fast and random memory accesses are like sending a pigeon to the moon. Algorithms can be analyzed theoretically to determine their bigO, but implementations must be analyzed empirically to determine how long they actually take, on a given computer.
BigO notation has a more formal definition in computer science. The most important thing to know about the formalism is that bigO reflects worst case performance, an asymptotic upper bound: “there exists a constant c such that the time required is for all n”. There are also ways to talk about best case behavior and average case behavior . Some algorithms, including reservoir sampling always take the same number of steps. A good example where it makes a difference is a popular sorting algorithm called quicksort, which almost always takes time but is worst case . Nonetheless, you will see people (including me) using notation for average case behavior sometimes. Technically this is sloppy and colloquial; they’re literally meaning “this calculation is going to take on the order of time”, rather than meaning to state a formal upper bound.
displaying data in graphs and tables
switching gears…
You’re reading the Weissgerber et al. paper this week because it makes important points about data presentation in graphs and tables. In particular:
Summary statistics are only meaningful when there are enough data to summarize.
An exemplary graph: Figure 1b from Ding et al. (Nature 536:331, 2016) on courtship song variation in Drosophila species, showing raw data points in addition to summary statistics. They summarize the data by mean and standard deviation, not standard error, because the first thing you want to know is how much variability you're seeing in samples, not how much variability is expected in the estimate of the mean. ('Open image in new tab' to embiggen.)
It is extremely common to see bar graphs in biology, displaying means and standard errors. But we’ve often only done a small number of replicates (maybe n = 310), and we could just as easily just show the raw data points: why don’t we?
Plotting mean and standard error implicitly assumes that the data can be summarized with just those two numbers. You’re basically assuming that your data were normally distributed, and maybe they weren’t. Looking at the raw data points shows you the actual distribution  maybe it was multimodal, skewed, had outliers, or whatever.
The other thing is the business of plotting the standard error (i.e. the expected error of the mean), not the sample standard deviation. If we’re visualizing the data, we’re first primarily interested in the variability of the data: if we do the experiment again, if we take another independent sample, where is it likely to be? That’s the sample variance, and the square root of that is the standard deviation. Maybe once we’re digging into our analysis a little deeper – we’re convinced that the data are distributed normally, and we want to parameterize where the mean is likely to be – then we might start worrying about how good our estimated mean is likely to be (the standard error) but that’s usually not what someone’s graph of their samples is aiming to represent, or implying that it represents.
swarm plots and jittered strip plots
An admirable, exemplary data figure is shown to the right: Figure 1B from [Ding et al, Nature 2016]. The raw data points themselves are plotted, alongside a summary of the mean and standard deviation (not standard error).
This business of plotting your actual raw data has become
trendy. There’s even a Twitter hashtag, #barbarplots. But packages for
graphing data are still catching up. (I ran across a member of the
Microsoft Excel development team replying to a community request for
this by saying “huh, why would anyone want to do that?”) A buzzword is
a swarm plot or a beeswarm plot. We’ll use the python Seaborn
module in this week’s homework because it has the ability to make
swarm plots. You can also do an almost identical thing with a strip
plot by adding random jitter (jitter=True
, in Seaborn). And you
can emulate all this easily enough in a scatter plot by plotting
your data points along one axis while adding a small amount of random
jitter in the other.
two other things we all learn in high school and forget

Significant figures: state your numbers to a nonobscene precision. Just because a computer outputs a floating point real number as ‘2.745970945707204’ doesn’t mean you need to report it that way; in biological data we’ve often only got numbers nailed down to a digit or two. ‘2.7’ is easier to think about anyway, and remember, Fermi estimation.

Always label your axes: seriously, always label your axes. Few people will stop you in a talk and ask you “um, what’s on the xaxis?” because it’s hard to ask that without it sounding like “um, do you even know how to make graphs?”. But that’s what they’ll be thinking.
tidy data
Hadley Wickham’s influential paper Tidy data is a systematic way of organizing tabular data. Wickham distinguishes ‘tidy data’ from everything else (‘messy data’). He observes that tabular data consist of values, variables, and observations.
A variable is something that can take on different values, either because we set it as a condition of our experiment (a fixed variable), or because we’re measuring it (a measured variable). For example, a fixed variable “gene” (which takes on 20,000 different values for different gene names), or a fixed variable “sex” (‘M’ or ‘F’), or a measured variable “TPM” (a number for an estimated expression level). A value is the actual number, string, or whatever that’s assigned to a variable.
An observation is one measurement or experiment, done under some certain conditions that we might be varying, resulting in observed values for one or more things we measured in that same experiment.
In a tidy data set,
 each variable forms a column
 each observation forms a row
 each type of observational unit forms a table
Wickham says it’s “usually easy to figure out what are observations and what are variables”, but at least for RNAseq data I’m not so sure I agree. I don’t think it’s entirely clear whether it’s better to treat genes as observations or variables. That is, it seems to me that this is tidy:
gene  mouse  cell_type  sex  TPM 

carrot  1  neuron  M  10.5 
carrot  2  glia  F  9.8 
kohlrabi  1  neuron  M  53.4 
kohlrabi  2  glia  F  62.1 
cabbage  1  neuron  M  0.4 
cabbage  2  glia  F  0.3 
but so is this:
mouse  cell_type  sex  carrot  kohlrabi  cabbage 

1  neuron  M  10.5  53.4  0.4 
2  glia  F  9.8  62.1  0.3 
Wickham says:
it is easier to describe functional relationships between variables (e.g., z is a linear combination of x and y, density is the ratio of weight to volume) than between rows, and it is easier to make comparisons between groups of observations (e.g., average of group a vs. average of group b) than between groups of columns.
which seems to argue most strongly for the second version above, with genes as columns. An “observation” is an RNAseq experiment on one sample. I get measurements for 20,000 genes in that experiment, and they’re linked. I might want to study how one or more genes covary with others across samples (either for real, or because of some artifact like a batch effect). But that’s a really wide table, 20,000 columns, and you’ll have trouble looking at it; many text editors will even choke on lines that long.
We’ll see in the homework that it’ll be easier to deal with using Seaborn to visualize some example data if we use the first version instead. We can transform the second form into the first by melting the data table, and Pandas gives us a method to automate that.
Clearly what I don’t want in a tidy RNAseq data set, though, is a bunch of fixedvariable values, representing my experimental conditions, masquerading as column headers, like the ‘wt’ vs ‘mut’ and ‘M’ vs. ‘F’ columns in this week’s homework. That’s #1 in Wickham’s list of most common problems in messy data:
 column headers are values, not variable names
 multiple variables in one column
 variables stored in both rows and columns
 multiple types of observational units in same table
 single observational unit, multiple tables
and he names the procedures for fixing these problems:
 melt: turn columns to rows, convert column header values to values of variables
 split: convert a column to 2+ columns
 casting: merge rows and create new columns
 normalization: getting one observational unit per table
The paper gives nice examples of each.
practicalities: a lightning intro to Pandas
One of our goals this week is to introduce you to Pandas (if you haven’t already been using it). Among other things, Pandas gives you powerful abilities to manipulate 2D tables of data.
See also:
 online Pandas documentation
 The book Python for Data Analysis, by Wes McKinney, the author of Pandas. The 1st edition is for Python 2. The 2nd edition, for Python 3, is on preorder, and will come out in October.
The usual import aliases Pandas to pd
:
import pandas as pd
creating a pandas DataFrame
A Pandas DataFrame is a 2D table with rows and columns. The rows and columns can have names, or they can just be numbered , .
You can create a DataFrame from a 2D array (a Python list of lists, or a numpy array), with default numbering of rows and columns, like so:
D = [[ 12.0, 16.0, 4.0, 8.0 ],
[ 7.0, 21.0, 14.0, 28.0 ],
[ 5.0, 25.0, 20.0, 10.0 ]]
df = pd.DataFrame(D)
Or you can create one from a dict of lists, where the dict keys are
the column names, and the list for each key is a column. Here I’m also
naming the rows sample1
etc.; Pandas calls the row names an index:
D = { 'tamarind': [ 12.0, 16.0, 4.0, 8.0 ],
'caraway': [ 7.0, 21.0, 14.0, 28.0],
'kohlrabi': [ 5.0, 25.0, 20.0, 10.0] }
df = pd.DataFrame(D, index=['sample1', 'sample2', 'sample3', 'sample4'])
Try these out in a Jupyter notebook page  type df
on its own and
Jupyter will show you the contents of the object you’ve created.
reading tabular data from files
The pd.read_table()
function parses data in a file:
df = pd.read_table('file.tbl')
By default, read_table()
assumes tabdelimited fields. You will also
encounter commadelimited and whitespacedelimited files commonly. You
specify an alternative singlecharacter field separator with a
sep='<c>'
argument, and you can specific whitespacedelimited with a
delim_whitespace
argument.
df = pd.read_table('file.tbl', sep=',') # commadelimited file. Same as pd.read_csv()
df = pd.read_table('file.tbl', delim_whitespace) # whitespacedelimited file
Column names: by default, the first row is assumed to be a header containing the column names, but you can customize:
df = pd.read_table('file.tbl') # First row is parsed as column names
df = pd.read_table('file.tbl', header=None) # There's no header. Don't set column names, just number them
df = pd.read_table('file.tbl', header=5) # The sixth row contains the column names (see: skiprows option)
df = pd.read_table('file.tbl', name=['col1', 'col2']) # Assign your own list of names
Row names: by default, assumes there are no row names (there are
fields per line, for column names), but if the first field
is a name for each row (i.e. there are $n+1$ fields on each data line,
one name and named columns), use index_col=0
:
df = pd.read_table('file.tbl', index_col=0) # First field is the row name
To get Pandas to skip comments in the file, use the comment=
option to
tell it what the comment character is:
df = pd.read_table('file.tbl', comment='#') # Skip everything after #'s
peeking at a DataFrame
(nrows, ncols) = df.shape() # Returns 'shape' of table: row x col dimensions
df.head(5) # Returns first 5 rows
df.sample(20) # Returns 20 randomly sampled rows
df.sample(20, axis=1) # Returns 20 randomly sampled columns
indexing rows, columns, cells in a DataFrame
Best way to learn is to play with example DataFrames in a Jupyter notebook page. I find Pandas to be inconsistent and confusing (sometimes you access a DataFrame as if it’s columnmajor, and sometimes like it’s rowmajor, depending on what you’re doing) but it’s powerful once you get used to it.
To extract single columns, or a list of several columns, you can index into the DataFrame as if it’s in columnmajor form:
df['caraway'] # Gets 'caraway' column, as a pandas Series
df[['caraway','kohlrabi']] # Gets two columns, as a smaller DataFrame
As a convenience, you can also use slicing operators to get a subset of rows; now you index into the DataFrame as if it’s in rowmajor form:
df[2:4] # gets rows 2..3; i.e. a halfopen slice interval [2,4)
df[2:3] # gets just row 2
df[2] # DOESN'T WORK: it will try to get a *column* named '2'
To access data by row/column labels, use the .loc[]
method.
.loc
takes two arguments, for which rows and columns you want to
access, in that order (now we’re rowmajor again). These arguments are
the row/column labels: either a single label, or a list. Note
that .loc
takes arguments in square brackets, not parentheses! Also
note that when you use slicing syntax with .loc
, it takes a closed
interval, in contrast to everywhere else in Python where you’d expect
a halfopen interval.
## When rows are labeled:
df.loc['sample1'] # Extracts row as a Series
df.loc[['sample1','sample2']] # Extracts two rows
## When rows are numbered, not labeled:
df.loc[2:4] # Extracts two rows when rows are numbered, not labeled. Note, closed interval!
df.loc[2:4, ['genotype', 'sex', 'A1CF']] # Extracts rows 2..4, and these three columns
df.loc[2,'A2M'] # Get a single value: row 2, column 'A2M'
To access data by row/column positions, use the .iloc[]
method. .iloc[]
has square brackets around its row/col arguments
just like .loc[]
but the arguments are an integer, a list of
integers, or a range like 2:5. Ranges in .iloc[]
are halfopen
intervals; compare:
df.loc[2:4] # By label, closed interval: gets rows 2..4
df.iloc[2:4] # By index, halfopen interval: gets rows 2..3
(I don’t want to hear anyone ever telling me about how welldesigned Python and its modules are. Useful, sure. Welldesigned… um.)
getting, or iterating over row and column labels
df.index # Gets row labels, as an index object
df.columns # Gets column labels, as an index object
list(df) # Gets column labels as a list
[c for c in df] # Example of iterating over column labels (as a list comprehension)
[r for r in df.index] # ...or over rows
sorting data tables
The sort_values()
method gives you a wide range of sorting abilities.
The by=
option tells it what field to sort on (which column label,
usually). The argument to by=
is a single column label (a string),
or a list of them. If it’s a list, it sorts first by the first one,
then second by the second, etc.
By default, it sorts rows, by some column’s value. If you choose ‘axis=1’, it sorts columns, by some row’s value.
By default, it sorts in ascending order; specify ascending=False
to
get descending order.
df_sorted = df.sort_values(by='coriander', ascending=False)
plotting example
To scatter plot two columns against each other (i.e. each row is a point on the scatter plot):
ax = df.plot(kind='scatter', x='caraway', y='coriander')