MCB112 Biological Data Analysis

Lectures: Mon/Weds 3:00-4:15pm, Fong Auditorium (Boylston Hall), starting Weds 31 Aug
Section: Fri 3:00-4:15pm, Fong Auditorium (Boylston Hall)

We are currently recruiting the teaching team for fall 2022 semester. If you are a Harvard graduate student, with experience in quantitative biology, statistics, and Python programming, and you're interested in teaching in MCB112 this fall, please contact the instructor, Sean Eddy.

Office hours will begin on Weds 31 August. All times Eastern. You can also contact TFs and CA directly for help and to make appointments.

Description

Biology has become a computational science. New technologies are generating larger and more complex data sets, especially in genomics and imaging. This course teaches computational, statistical, and mathematical methods for biological data analysis, using an empirical and experimental framework suited to the complexities of biological data, emphasizing computational control experiments. The course is primarily aimed at biologists learning the fundamentals of data analysis methods, but it is also suitable for computational, mathematical, and statistical scientists learning about biological data.

Aims and objectives

MCB112 teaches fundamental principles of biological data analysis by example. The course is built around 12 weekly data analysis problems. These problems typically use synthetic simulated data sets from a fictitious in silico creature, the sand mouse Mus silicum. Most problems focus on gene expression analysis with RNA-seq, just to have one coherent experimental data type to focus several different data analysis problems on; this is not an RNA-seq course per se.

In the course of solving analysis problems, you will learn practical skills in how to write scripts to analyze data, and how to use simulations to do computational positive and negative control experiments. The course is taught in Python, using Python-based data science tools including NumPy, SciPy, Pandas, and Jupyter Notebook. You will learn how to do computational science: how to understand computational methods, how to design computational experiments, how to think critically, how to develop an organized work pattern, and how to communicate results reproducibly.

You will also learn some fundamentals of probabilistic inference, statistics, computer science, and applied math -- how to think about statistics from first principles, and how to read and understand an algorithm well enough to implement it. The course aims to motivate biologists to learn more mathematics, computer science, and statistics, by showing how these skills are relevant to biological data analysis.

Prerequisites and background

There are no formal prerequisites. We expect students to come from different backgrounds -- a mix of biologists, computer scientists, and applied mathematicians -- and to have varying degrees of experience in writing code in Python. The course is designed to bring students up to speed in any area that they haven't seen much of before.

MCB112 is designed to be a course that could come before other rigorous coursework in biology, programming, statistics, and applied math, even though we do a mix of biology, programming, stats, and math in the course. Underlying the course's design is a philosophy that a biologist (indeed anyone) is perfectly capable of learning enough math, programming, and statistics to do sophisticated data analyses, but many of us have trouble building up abstract skills without first knowing why we're doing it. MCB112 emphasizes practical data analysis problems, and though at times you may feel like you've been dropped into the deep end of the pool, you will learn by example why math/programming/stats skills grant you mutant superpowers for modern biology research. In part, we judge the success of MCB112 by how many of our students go on to take coursework in fields they wouldn't have dreamt of studying before.

However, it would be tough to come into the course with no background at all. We expect you to have course background in either the molecular biology side or the stats/math/programming/CS side. We do molecular biology at the level of LS1; Python programming around the level of AM10, CS109, or CS50; statistics around the level of STAT110 and STAT111; multivariate calculus and linear algebra around the level of MA21 or AM21; and a wee taste of data structures and algorithms. The more of these things you've taken already, the easier MCB112 will be.

Flow of the course

You'll proably find that most of the work is outside of class on your own, working on the weekly data analysis problems.

The Monday lecture each week covers fundamental background you need to know for that week's problem. We expect you to start thinking about your approach to the analysis problem after the Monday lecture.

The Wednesday lecture will start to get more practical. I'll cover any computational or mathematical techniques that you need for the pset. We expect you to come with whatever questions you have from thinking about the problem so far, for discussion.

The Friday class is run by the teaching fellows, and is even more practical. They will walk through the problem set and give you more framework for how to approach it. They will walk you through suggested approaches and resources. For example, especially in the early weeks of the course when people are coming up to speed, we will show Python code examples of related problems. Because we expect some people will have never coded before, we'll aim to bring you up to speed quickly and practically by providing working Python code examples that you can study and adapt.

Problem sets come out each Monday. Your solution to each week's problem is due the following Wednesday at 1pm (in 9 days time). You can always turn in a pset early if you despise the overlap, but we figure that giving you a couple of days after the weekend is more civilized than making the pset due on Monday. It also gives you some time to ask us more questions on Monday/Tuesday. You submit your solution by email as a Jupyter notebook page.

Grading

The grade is based entirely on the weekly data analysis problems. There are no exams or finals.

Each weekly problem is graded on a scale of 1-10:

10 proficient
9 correct but minimal comments or explanation
8 very minor errors or missing comments or explanation
7 minor errors or partially incorrect answers
6 major errors or incorrect answers
5 pseudocode with good comments, but no implementation (i.e. good pseudocode gets at least half credit)
4 minimal pseudocode
3 no pseudocode, but minimal idea sketching
2 any attempt
1 no answer

We will give extra credit of 0.5 or 1.0 for particularly excellent answers - because you went above and beyond in writing a clear description, because your work or your code shows exceptional creativity, or because you did additional analyses or visualizations.

The final letter grade is an unweighted mean of the 12 weekly grades:

\(\geq\) 9.3 A
[9.0..9.3) A-
[8.7..9.0) B+
[8.3..8.7) B
[8.0..8.3) B-
[7.7..8.0) C+
[7.3..7.7) C
[7.0..7.3) C-
[6.7..7.0) D+
[6.3..6.7) D
\(<\) 6.3 F

We will accept late psets up to two days late (1pm Friday) for 75% credit (i.e. your score will be multiplied by 0.75). We may consider other arrangements for late psets due to rare extenuating circumstances on a case-by-case basis, and generally only if you have discussed the circumstances with us in advance. Like, maybe you have to miss a week because of something else important that you're doing. Work that out with us in advance!

We will drop your lowest pset grade and replace it with your highest grade from a subsequent pset. This is different from just dropping your lowest grade. The aim is to incentivize you to keep trying to improve throughout the course, right up to the final pset.

We expect that people taking the course have a range of different backgrounds. Some people have extensive Python programming experience, and some people have never programmed in any language before. The course is meant to be accessible to both sorts of people. If you don't have much programming experience, we're going to take that into account, and cut you slack especially early in the course. That'll make the grading a little subjective. We're not aiming to use the grade as some sort of objective professional-competency ruler. We're not gatekeepers; we're personal trainers. We want you to work hard and stretch yourself. We want your grade to reflect what you got out of the course, not what you brought into it.

Materials and access

There is no required textbook for the course. All lecture notes will be available online here on the mcb112.org web site. By Monday of each week, we'll post that week's lecture notes and the pset. On weeks that a pset is due, by about Friday we'll post answers. Links to all these appear in the navigation sidebar, and in the online schedule.

Lectures will be recorded. They'll appear shortly after each lecture in the Panopto tab of the Canvas site.

There is a Piazza forum for asking questions (and getting answers, even!). The entire teaching team will be active on Piazza for a wide range of different hours. We have a private Piazza license for MCB112. Harvard has been unable to reach an institutional agreement with Piazza because of concerns over Piazza's marketing practice. Be a little wary of what personal info you provide to Piazza. Apparently they're monetizing it to provide you with "career services". If you run across something that concerns you, let me know.

You need to have access to a computer (laptop or otherwise) that you can install a Python scientific data analysis environment on. We recommend the free Anaconda distribution. You should install Anaconda ahead of time.

If you can't get Anaconda installed, you can use JupyterHub in a web browser instead. Go to the "FAS OnDemand" tab of Canvas and choose "Jupyter Notebook (General)". This launches Jupyter on the FAS Research Computing cluster, with a standard Python3 environment.

If you don't have a computer, the course has a limited number of Mac laptops available for lending. Don't hesitate or be embarrassed to ask for one. (When I was an undergraduate, there was absolutely no way I could afford a computer; one of my professors kindly lent me one.)

You'll turn in your pset each week as a single Jupyter Notebook .ipynb file, submitted to the "Assigments" tab of our Canvas page.

There are many on-line resources for learning Python, but for books, we recommend:

  • Mark Lutz, Learning Python. O'Reilly and Associates, 2013. (5th edition)
  • Wes McKinney, Python for Data Analysis. O'Reilly and Associates, 2017. (2nd edition)

The Harvard library has electronic copies of both books available for students in the course, and you can find links to them on our Canvas page.

Getting started with Python

If you haven't programmed before and you want to get a head start on learning Python, check out the Harvard College Python Boot Camp. This is a short, online (asynchronous) four-lesson introduction.

Academic integrity

You must do each week's data analysis project individually, on your own, rather than working collaboratively in groups. Your writing and your code must be your original work.

One goal of the course is to teach you how to understand and do biological data analysis yourself, without relying on interdisciplinary collaborations between people of disparate skills and interests. This is what the weekly data analysis projects are designed to push you to do.

As you're learning, though, you are free to talk with each other, and to consult any resource, and to study code from other sources. This is how we learn anything. It's how you'll learn at every step of your future. We trust you to know the difference between asking a friend for help versus copying their answer.

The principle is that although we encourage you to learn in any way that you prefer, each week you must reach the point where you can understand and execute your work independently and originally; this is what we'll be looking for.

We trust you. We expect you to act with honor and integrity. For example, it would not be hard to find previous versions of the course notes and problem sets online, but you should not go looking for them, and we trust you not to. We expect you are taking the course because you want to learn.

Auditing the course; pass/fail option

Auditing the course is not practical or meaningful. The course is entirely structured around the psets and various ways of having the teaching staff help people do them; we only have teaching staff sufficient for registered students. That said, anyone is welcome to follow along with the material posted here at mcb112.org, or to sit in on the course lectures (space permitting).

You can take the course pass/fail. It's no problem from our perspective; check with your concentration advisor to make sure it's ok from theirs. If you're taking MCB112 towards your concentration, you generally need to be taking it for a grade. There's a petition form from the registrar that I sign for you.

Simultaneous enrollment

If you want to take another course with lecture times that overlap with MCB112, the registrar may require you to get a signed "simultaneous enrollment" agreement. We cannot sign these. The registrar requires us to provide "hour-for-hour direct and personal compensatory instruction", which we are unable to do.

I don't recommend signing up for an overlapping course. You should plan to attend lectures and engage with MCB112 fully. But in my view, it's up to you; provided that you're able to enroll with the registrar, you can navigate through MCB112 as you see fit.

Accommodations for disabilities

Students needing accommodations because of a disability should present their Faculty Letter from the Accessible Education Office (AEO) and speak with Sean or Mary by the end of the second week of the term (10 September) for us to be able to respond in a timely manner.