MCB112: Biological Data Analysis (Fall 2017)

week 12:

doing computational work

Compared with working at the bench as an experimental biologist, you’ve probably found that computational work moves with a different rhythm, and requires a different style of organizing your work and keeping your notes.

With wet experiments, it takes longer to do the work than to plan the work. When I was doing experiments, I often had a comfortable rhythm evenings at my desk writing down a protocol for the next day; working the next day at the bench, making occasional notes on my protocol where I improvised or did something went wrong; then back at my desk to write up the day’s data and observations in my permanent notebook.

I find that computational work is more improvisational, with a rapid trial-and-error cycle. I have a general plan, but I make up the detailed execution as I go. The moment I think of something, I can type the command or write a line of code and see what happens. Every command I type is a little experiment. Code and analysis protocols develop by rapid evolution. It takes longer to plan and document my work than to just keep doing it. But working like this, after a while, I can’t remember what I did, I can’t find when and where I did it, and I can’t reproduce it.

The volume of the data and the code also becomes an issue. Even if I could print it out and paste it into my notebook, it’s dead there. I need my data and code live in the computer, searchable and executable.

It was easier to learn how to do experimental work by watching people. Doing experiments, I was physically engaged with other people I could learn from, in the same lab, sharing the same equipment. I saw the style and habits of the people around me. Computational work is more insular and highly personal: me on my laptop. I rarely get to see the style and habits of other computational scientists, even in my own lab.

Here I’ll tell you about how I’ve learned to do my own computational work. I’m still evolving my system, and always have been. Don’t take this as gospel, only as ideas. I’m exposing my style and habits to you. Geeks love to talk to much about their tools, so I’ll try to restrain myself to only the important bits.

The main spine of what I’m going to tell you about:

a paper notebook

Your mileage may vary, but one of the most important things to me is getting away from the computer and having a peaceful hour with pen and paper to think about what I’m doing. I still keep a paper notebook, comparable to what I did when I was an experimentalist, but I use it in a different way now. My notebook keeps the high level summary of what I was doing and why, and summarizes the results. All the reproducible details are in the computer.

A typical page from a research work day has:

The purpose of my paper notebook is to keep track of things at a high level. All the data and code is the computer. I can look at the notebook page to get the gist of what I’ve done, and I can refer into the computer if I need details.

daily rhythm

At the beginning of a good work day, I sit with a cup of coffee, look back over previous work, and write down what I want to get done that day. At the end of a work day, I sit with a favorite beverage, think about whether I actually got anything done, and summarize it and write it down.

I often find that when I’m thinking about what I did, I didn’t quite finish it properly, because I was working in a hurry and playing around with different ways to do it. Summarizing it in my paper notebook at the end of the day, trying to pull things together into a coherent summary almost like a draft of a mini-paper, forces me to go back and make things coherent. I realize that I forgot to record an important version number, or I didn’t finish making an important plot that I wanted. So unlike my rhythm for experimental work, I iterate back to the actual work, cleaning it up in the computer as I’m summarizing it on paper.

I put myself in the mindset of “I am going to be looking at this a year from now, and I will not remember what I did”. I write consciously for my future self. Similarly my software code is littered with comments like “you’re going to want to rewrite this because it looks wrong but don’t, because it’s right, and here’s why…” When I was getting started in science I thought I was going to remember everything I did (and for the most part, I did!). I thought the purpose of my notebook was for my advisor to look at what I did. But I have never once had any advisor look at my notebooks, whereas after 30+ years in science, I’ve gone back innumerable times looking for something I know I knew once (and often failed). Now I realize my notebook system isn’t for anyone else. It’s an extension of my own memory.

one notebook to rule them all

I only keep one notebook at a time, for everything: research, notes, ideas, to-do lists, planning trips I have to take, angry interventional diatribes against myself, and so on. I used to keep multiple notebooks for different things, but this was crazy. I never had the right notebook with me when I want it. My notebook is a chronological diary. I remember things better that way, in context of everything I was doing at the time.

I keep a table of contents, with one descriptive line per page. I can scan quickly through years of work through these tables of contents. Because it’s all chronological and in context of everything else I’m doing, and because I do this kind of scan so often, it ends up being a kind of memory palace trick. I end up burning the structure of my notebooks into my memory.

For about the last fifteen years or so, I’ve gotten my notebooks from Eureka Notebooks. They’re bound black notebooks, 152 numbered pages, with front pages for a table of contents, and they fluff up enough to accommodate a bunch of cut-and-pasted additions. I go through about one or two notebooks a year. I number them (my St. Louis notebooks are STL1-11; Janelia are J1-15; I’m on H4 now here at Harvard) and I keep the old ones on my shelf. When a notebook is done, I have it scanned to a PDF (many copy/print stores can do this for you), and I keep all the PDFs on my laptop, so I always have access to all my notes even when I’m travelling. The PDFs are also my long-term archive; I’m careful to keep them backed up in multiple places. I also keep and archive the scanned PDFs of other people’s notebooks from my laboratory, when they leave, and they keep the originals.

I’m compulsive about everything, down to brands of pens and pencils, but I’ll spare you. The important thing is to use permanent ink, not pencil for your notes. Your notes need to last decades, and they need to survive disasters like coffee spills.

The notebook is mostly for me to remember my own work, but a laboratory notebook is also a legal record. If there were ever a dispute over intellectual property – when you had an idea, when you invented something – your notebook could become evidence in court. This is extremely rare, but it does happen. My notebooks have never had to appear in court (I had one close call a few years ago - long story), but I have seen other people’s notebooks get the legal treatment. I’ve served as an expert witness, and it’s brutal watching lawyers go into a scientist’s notebook looking for ways to impugn it. So even though the main purpose of my notebook is for my own purposes, I’m also conscious of doing things by the book when they’re not too inconvenient research-wise. For example, if I ever make a change or additional notes on an old page, I usually do it in a different color pen, and I date and cross-reference the change. This is more for my own good (forward- and backward-tracking errors) than for the legal record, but it also helps with the legal record if it ever came to it.

People will tell you you’re not supposed to carry the laboratory notebook outside the laboratory, to minimize the risk of losing or damaging it. This is a rule I routinely violate, mostly because I work everywhere with my laptop. I carry my notebook and laptop everywhere – airplanes, coffee shops, conferences – so I’m assuming those risks.

A downside of using paper is that it’s not as searchable as an electronic record. I rely very heavily on my tables of contents, and my memory. One thing I haven’t done yet that I probably will get to someday – OCR’ing my scanned notebook PDFs, to make them full text searchable. Technology appears to be almost up to this now.

an electronic notebook (or rather: filing system)

It’s important to distinguish the needs of creative, day-to-day research from the needs of a production pipeline. If you’re building a robust production pipeline you’ll get pulled in the direction of lab management software and electronic notebooks: with the digital advantages of high capacity, searchability, electronic archiving, but with disadvantages of additional overhead and inflexibility (working within a specific GUI or electronic notebook system).

For creative exploratory work, the premium is on flexibility, and therefore on simplicity. For my work I’ve never found it convenient to use an “electronic lab notebook” (for one thing, I like to draw diagrams freehand - another reason I like my paper notebook), and every system I’ve explored, I find too constraining. Instead I organize my “notebook” on the computer very simply, using the unix filesystem and some simple naming conventions. I want two things out of it:

I have a directory in my home directory called notebook. In there, each computer-y work thing that I do (that is, anything associated with computer files) gets an “electronic notebook page”: a directory, named by date-title, something like 1122-hmmer4-finished. In each directory, I have a file called 00README (or lately,, formatted in Markdown) that contains my top-level notes and crossreferences. Anything other files or directories I want to keep from the work, I put them in the directory.

By using the date-title naming scheme, the unix filesystem automatically sorts listings chronologically when I do an ls. Similarly, by naming my main notes file 00README, ls will sort that file to the top of listings of contents. The brief title (the hmmer4-finished part) is useful if I’m trying to find something quickly by scanning the notebook listing.

Because I keep both my paper and electronic notes chronologically, I generally don’t keep directories organized by project (some people do). Instead, different projects I’m working on are interleaved chronologically. Like my paper notebook, almost everything I do electronically goes into the notebook, in one chronological sequence: research, grant and manuscript reviews, MCB112 pset drafts, and so on. My notebook directories are accessible by default to the other people in my lab (so if I send someone a crossreference like 1122-hmmer-finished, they know where to look), and I chmod 700 any directories that are confidential for some reason.

By the end of a year, I have something like 100-200 directories in my notebook. Now I create an Archive2016 directory and move them all into this archive. (In a couple of years or so, when I’m no longer accessing them much, these archive directories get moved out of my current notebook and into a directory structure that I use for longer term archiving, along with other stuff I’m archiving.)

The 00README crossreferences to the corresponding paper notebook page summary (xref H2/15); the paper notebook crossreferences to the electronic notes (1122-hmmer4-finished). I can search either the paper table of contents or the directory listing in the electronic notebook chronologically by short title/description to find previous work.

daily work pattern: what goes in the 00README

The 00README is just a plain text file (I use emacs as my text editor), or perhaps in a lightweight markup language called Markdown. Its primary purpose is to receive a lot of back and forth cut-and-pasting as I work.

I’m mostly working at the command line in a terminal window. When a command works and it’s on the main stream of what I’m trying to get done, I cut and paste it to the 00README. Some electronic notebook systems simply log everything you type. This is insanity. Only a small fraction of my commands end up being on-point for the work.

At the same time, I’m organizing the 00README into little blocks of code, massaging it into what are essentially short bash scripts, and I’m annotating it by what I’m trying to do. Then when I iterate back to redo something, or improve on something, I can cut-and-paste an existing “script”, either straight into the terminal, or elsewhere in the 00README where I make modifications to it.

As I’m working, the 00README starts to become an organized document that sort of looks like a Jupyter Notebook page, but in plaintext, with a mix of text documentation and bash commands. Any scripts I need (perl, python, R, bash) are in .pl, .py, .r, .sh files in the directory, and I make note of their reason and existence in the 00README at the appropriate place.

I don’t like working in something like Jupyter Notebook itself, just because it comes with added inflexibility and overhead (I stick close to plaintext and the command line to keep things simple and flexible). Also because after all these years, emacs is hardwired into my neurons, and I can’t stand working with text in a system that doesn’t have emacs power at its disposal.

“provenance” section

One of the easiest things to lose track of in computational research is the exact versions of software and data that you used. I often add a “provenance” section at the top of the 00README that lists where I got various data and code from.

I will also often use symlinks in the notebook directory to link to exact versions of databases that I use, for example:

   ln -s /n/eddy_lab/data/uniprot/uniprot-62.3/uniprot-swissprot.fasta swissprot.fa

creates swissprot.fa in my working directory as a link to a specific permanent database on our lab disks. (We have a system in the lab for maintaining archival copies of old data that keeps such symlinks alive for a long time, and such that the directory names include version numbering; thus even when we clean up and delete old data, the now-dead symlink contains the relevant versioning info just in its name alone.) The symlink-creating ln -s command goes into my provenance section. We keep our installed software in a similar way, so symlinks to installed software will automatically convey version info.

When I’m using my own development code, I’ll typically record the git commit version in the provenance section.

Because I generally work chronologically rather than by projects, a particular project evolves over time across multiple directories. When I do a new version of an older experiment, I make copies of scripts or other things I need, and note those cp commands in my provenance section. Each notebook directory is a standalone archive of what I did on a particular day (or perhaps over a few days). Like my paper notebook, it’s unusual that I go back and edit old notebook directories. This means I have a lot of copies of copies, which in general in software development and computational science is supposed to be a Bad Thing that you’re supposed to use version control for. Arguably, I could achieve the same ends in a version control system that snapshots an entire directory (such as git), but I’ve found that I prefer to be able to immediately look back over the chaotic chronology of a project’s directories, rather than organizing it as a single git project and pretending that it’s less chaotic than it actually is.

working toward a manuscript

As time goes on, various experiments are working their way toward being figures and tables of a manuscript, as I redo them (over and over and over again). Notebook directories are starting to get titles like 1122-projectx-figure4-v2. The amount of copying increases (copying v1 of an experiment to v2 to improve and redo it), and the provenance tracking gets more careful. The 00README gets tighter, evolving toward something that’s starting to look more like supplementary material than disorganized chaos.

Meanwhile I’m writing the paper in a separate directory organization, under something like writing/manuscripts/projectx. In there I have something like a Figures subdirectory, and in there I am either symlinking directly to the current versions of figures and tables that are in my notebook.

By the time the manuscript is ready, the methods and supplementary data are essentially already written in my notebook: the relevant 00README’s for each figure or table contain the exact commands needed to reproduce the work. I aim to be able to cut-and-paste whole blocks of commands from the 00README and nigh-automagically reproduce the figures and tables exactly. We bundle all this stuff up, turn symlinks into copies, organize it and clean it a little more, make sure the commands all still work (it’s surprisingly easy to mess up a trivial detail in the cleaning-up phase, breaking everything) and I release it as a tarball accompanying the paper.

code, version control, and git

For software and code that I’m using often and across projects, I eventually adopt them into a project-oriented directories in a top-level directory called src (for “source code”) for software, and scripts for scripts. New scripts and programs are born in notebook directories, and get promoted to src and scripts. Once there, they come under version control, and currently I’m using git as my version control system.

It’s easy to think of git and GitHub (as in] as the same thing. GitHub is an online open source code sharing system, and it’s great. A bunch of stuff from our lab is hosted at GitHub, for our largest and most widely distributed software projects. MCB112 itself is hosted in a private GitHub repository.

git, though, is a command line program that you can run easily on your laptop without ever dealing with github. I have innumerable git repositories on my laptop, for different personal projects. I use git as part of my normal daily workflow, whereas github is part of our lab publication and dissemination strategy. I use git not just for code and software, but for pretty much anything where I generally only want to see one current up-to-date version, where I also want to keep old versions and log info about my changes.

lightning git intro

If you haven’t used git or version control before, it’s important to know how easy it is. If you don’t have it installed, go here and follow the instructions. Suppose you have a new project with a couple of files, like so:

   mkdir myproj
   cd myproj
   echo "This is file1" > file1
   echo "This is file2" > file2

then, sitting in that directory, you initialize a new git repository like so:

  git init

and boom, you have a git repository: it’s stored in .git, a hidden directory in your project.

To tell git which files you want it to take care of (“track”), you use git add:

   git add file1 file2

When you have your project in a state that you want to save (“commit”), you use git commit to take a versioned snapshot of it:

   git commit

It’ll ask you to write a log message, where you make notes about what you’ve done so far.

When you make changes, bringing the project to a new state, you git add the new and changed files, and git commit a new snapshot of the project:

   echo "This is a new line on file2" >> file2
   git add file2
   git commit

To look back at your version history, you type git log.

Specific git commits are identified by a horrendous long string like 6f85a71cb764a82e33d5402ac052a5c52260cd75 (the “SHA-1 hash”, in the jargon) but are traditionally referred to by just a brief prefix of the hash; any (unique) prefix will actually do, but git pros seem to default to 7, as in 6f85a71.

To revert the project to an earlier state, you use git checkout. For example:

   git checkout 6f85a71

if your git log shows that you have a commit prefixed 6f85a71. The entire state of the project (the files that git is tracking, anyway) is immediately timewarped back to the way it was in commit 6f85a71. The main line of your commits is called master by default, so to get back to your current state:

   git checkout master

That’s all you really need to know to get started using git productively. Good git tutorials online include one from Atlassian, and the Scott Chacon and Ben Straub book Pro Git.

Further reading on computational work

Two good articles about organizing computational work:

Further reading on doing science

Two great essays on doing science: