### MCB112: Biological Data Analysis (Fall 2016)

 Instructors: Prof. Sean Eddy office hours: Fri 2-3pm, Biolabs 1008 Dr. Tim Dunn office hours: Mon 5:30-7pm, Biolabs 1008 Laura Bagamery office hours: Mon 10-11am, Northwest 463 Chris Nam office hours: Tues 10:30-11:30, Biolabs 1008 Kaia Mattioli office hours: Weds 4:30-6pm, Bauer 304 Jack Nicoludis office hours: Thurs 2:30-3:30, Northwest 330 Marco Catipovic office hours: Sun 4:30-5:30p, Science Center 113
 Lectures: Tue/Thu 1-2:30pm Biolabs 1080 lecture hall Section: Fri 5:30-7pm Biolabs 1080 lecture hall

### Description

Biology has become a computational science. New technologies are generating increasingly large and complex data sets, especially in genomics and imaging. This course teaches computational methods for biological data analysis, using an empirical and experimental framework suited to the complexities of biological data, emphasizing computational control experiments. The course is primarily aimed at biologists learning computational methods, but is also suitable for computational and statistical scientists learning about biological data.

### Aims and objectives

The aim is to teach fundamental principles of biological data analysis by example. The course is built around roughly 12 weekly data analysis problems. The data are simulated from a fictitious organism, the sand mouse Mus silicum - i.e. a simulated in silico creature. Almost all the problems focus on genomics and gene expression analysis (RNA-seq, etc.), but this is not an RNA-seq course per se.

In the course of solving the analysis problems, you will learn practical skills in how to write scripts to analyze data, and how to use simulations to do computational positive and negative control experiments. The course is taught in Python, using Python-based data science tools including Pandas and Jupyter Notebook. You will learn how to do computational science: how to understand computational analysis methods, how to design computational experiments, how to think critically, how to develop an organized work pattern, and how to communicate results reproducibly.

You will also learn some fundamentals of probabilistic inference, statistics, computer science, and applied math – how to think about statistics from first principles, how to read and understand an algorithm well enough to implement it, and examples of why, yes, biologists do need to use calculus sometimes.

### Prerequisites and background

There are no formal prerequisites, because we expect students to come in from different backgrounds – a mix of biologists, computer scientists, and applied mathematicians – and with varying degrees of experience in writing code in Python. The course is designed to bring students up to speed in any area that they haven’t seen much of before.

Of course, the more background you have in the areas of molecular biology, Python coding, algorithms and data structures, probability and statistics, and multivariate calculus, the better off you’ll be. You’ll want to have course background in at least some of these areas, to leave time to come up to speed in the others.

Most of the work is outside of class on your own, working on the weekly data analysis problems.

The Tuesday lecture each week covers fundamental background you need to know for that week’s problem. We expect you to start thinking about your approach to the analysis problem after the Tuesday lecture.

The Thursday class time is more interactive and practical. We expect you to come with whatever questions you have from thinking about the problem so far, for discussion and review. We will walk you through approaches and resources you might want to tap; for example, especially in the early weeks of the course when people are coming up to speed, we will show Python code examples of related problems. (Because we expect some people will have never coded before, we’ll aim to bring you up to speed quickly and practically by providing working Python code examples that you can study and adapt.)

After that, you’re working on your own on the week’s problem. The instructors and teaching assistants are available for office hours and recitation sections for more individual discussion and questions.

Your solution to each week’s problem is due at the start of the next week’s Tuesday lecture (1pm). You submit your solution by email as a Jupyter notebook page.

We generally won’t accept late work. We may consider rare extenuating circumstances on a case-by-case basis, and generally only if you’ve discussed the circumstances with us in advance. (Like, if you know you have to miss a week because you have to be out of town for something important, work that out with us beforehand.)

The grade is based entirely on the weekly data analysis problems. There are no exams or finals. Grades are not curved. We expect that everyone in the course will be able to solve every analysis problem proficiently, or at least competently – we will consider our work on the course to be a failure otherwise. Each problem will be graded on a scale of 1-5, where 1=proficient, 2=competent, 3=needs work, 4=insufficient effort, 5=zero effort, in 0.5 increments. Evidence of hard work alone, even with an unsuccessful solution, guarantees at least a 3 – most of the battle in learning data analysis is in investing time and thought in it, even if it comes slowly at first for some. The final letter grade is an unweighted average of the weekly problem grades, with A$\leq 1.33$, A-$\leq 1.67$, B+$\leq 2.0$, B$\leq 2.33$, etc.

Regular attendance is expected.

You can use laptops and mobile devices to take notes in class. We will split the lecture hall into two sides, devices-allowed and devices-free, so you can choose to sit in the devices-free side to reduce keyboard clacking distractions.

### Materials and access

There is no required textbook for the course. Readings will be available online as PDFs.

You need to have access to a computer (laptop or otherwise) that you can install a Python scientific data analysis environment on. We recommend the free Anaconda distribution from Continuum Analytics. You’ll be ahead of the game if you install it ahead of time. If you do not have one, the course has a limited number of Mac laptops available for lending. You need to have Internet access; among other things, you will be submitting your work each week electronically as a Jupyter notebook page.

There are many on-line resources for learning Python, but for books, we recommend:

• Mark Lutz, Learning Python. O’Reilly and Associates, 2013.
• Wes McKinney, Python for Data Analysis. O’Reilly and Associates, 2012.

For an excellent (albeit formal/mathematical, and physics-oriented rather than biology-oriented) introduction to the fundamentals of data analysis, we recommend:

• D.S. Silvia and J. Skilling, Data Analysis: A Bayesian Tutorial. Oxford, 2006.