MCB112 Biological Sequence Analysis

Lectures: Mon/Weds 3:00-4:15pm, Fri 3:00-3:50pm
Location TBD
Starting Weds 4 September

teaching team office hours location
Prof. Sean Eddy TBD TBD
Liana Merk TBD TBD

Are you a PhD student in quantitative biology, looking for a fabulous TF opportunity? Look no further - we're recruiting for a teaching team of about 6-7 for fall 2024. Contact me at seaneddy@fas.harvard.edu.

Office hours will begin Weds 4 September too. You can also ask questions and get help on our Ed Discussion board for the course.

Description

Biology has become a computational science, requiring analysis of large data sets from genome sequencing and other technologies. This course teaches computational methods in biological sequence analysis, using an empirical and experimental framework suited to the complexities of biological data, emphasizing computational control experiments. The course is primarily aimed at biologists learning computational methods, but is also suited for computational and statistical scientists learning about biological sequence data.

Aims and objectives

MCB112 teaches fundamental principles of comparative genome sequence analysis (and more generally, of computational molecular biology) by using realistic but pint-sized research examples. The course is built around 12 weekly analysis problems using bacteriophage (bacterial virus) genome sequences collected by undergraduates in HHMI-sponsored SEA-PHAGES courses.

In the course of solving analysis problems, you will learn practical skills in how to write scripts to analyze biological data, and how to use simulations to do computational positive and negative control experiments. The course is taught in Python, using Python-based data science tools including NumPy, SciPy, Pandas, and Jupyter Notebook. You will learn how to do computational science: how to understand computational methods, how to design computational experiments, how to think critically, how to develop an organized work pattern, and how to communicate results reproducibly.

You will also learn some fundamentals of probabilistic inference, statistics, computer science, and applied math -- how to think about statistics from first principles, and how to read and understand an algorithm well enough to implement it. A not-so-hidden agenda is to motivate biologists to learn more mathematics, computer science, and statistics, by showing how these skills are relevant to biological data analysis.

Prerequisites and background

There are no formal prerequisites. We expect students to come from different backgrounds -- a mix of biologists, computer scientists, and applied mathematicians -- and to have varying degrees of experience in writing code in Python. The course is designed to bring students up to speed in any area that they haven't seen much of before.

MCB112 is designed to be a course that could come before other rigorous coursework in biology, programming, statistics, and applied math, even though we do a mix of biology, programming, stats, and math in the course. Underlying the course's design is a philosophy that a biologist (indeed anyone) is perfectly capable of learning enough math, programming, and statistics to do sophisticated data analyses, but many of us have trouble building up abstract skills without first knowing why we're doing it. MCB112 emphasizes practical data analysis problems, and though at times you may feel like you've been dropped into the deep end of the pool, you will learn by example why math/programming/stats skills grant you mutant superpowers for modern biology research. In part, we judge the success of MCB112 by how many of our students go on to take coursework in fields they wouldn't have dreamt of studying before.

However, all that said -- it would be tough to come into the course with no background at all. We expect you to have course background in either the molecular biology side or the stats/math/programming/CS side. We do molecular biology at the level of LS1; Python programming around the level of AM10, CS109, or CS50; statistics around the level of STAT110 and STAT111; multivariate calculus and linear algebra around the level of MA21 or AM21; and a wee taste of data structures and algorithms. The more of these things you've taken already, the easier MCB112 will be.

Flow of the course

You'll probably find that most of the work is outside of class on your own, working on the weekly data analysis problems.

The Monday lecture each week covers fundamental background you need to know for that week's problem. We expect you to start thinking about your approach to the analysis problem after the Monday lecture.

The Wednesday lecture will start to get more practical. I'll cover any computational or mathematical techniques that you need for the pset. We expect you to come with whatever questions you have from thinking about the problem so far, for discussion.

The Friday class is run by the teaching fellows, and is even more practical. They will walk through the problem set and give you more framework for how to approach it. They will walk you through suggested approaches and resources. For example, especially in the early weeks of the course when people are coming up to speed, we will show Python code examples of related problems. Because we expect some people will have never coded before, we'll aim to bring you up to speed quickly and practically by providing working Python code examples that you can study and adapt.

Problem sets come out each Monday. Your solution to each week's problem is due the following Wednesday at 1pm (in 9 days time). You can always turn in a pset early if you despise the overlap, but we figure that giving you a couple of days after the weekend is more civilized than making the pset due on Monday. It also gives you some time to ask us more questions on Monday/Tuesday. You submit your solution by uploading a Jupyter notebook page to our Canvas.

Grading

The grade is based entirely on the weekly data analysis problems. There are no exams or finals.

Each weekly problem is graded on a scale of 1-10:

10 proficient
9 correct but minimal comments or explanation
8 very minor errors or missing comments or explanation
7 minor errors or partially incorrect answers
6 major errors or incorrect answers
5 pseudocode with good comments, but no implementation (i.e. good pseudocode gets at least half credit)
4 minimal pseudocode
3 no pseudocode, but minimal idea sketching
2 any attempt
1 no answer

We will give extra credit of 0.5 or 1.0 for particularly excellent answers - because you went above and beyond in writing a clear description, because your work or your code shows exceptional creativity, or because you did additional analyses or visualizations.

The final letter grade is an unweighted mean of the 12 weekly grades:

\(\geq\) 9.3 A
[9.0..9.3) A-
[8.7..9.0) B+
[8.3..8.7) B
[8.0..8.3) B-
[7.7..8.0) C+
[7.3..7.7) C
[7.0..7.3) C-
[6.7..7.0) D+
[6.3..6.7) D
\(<\) 6.3 F

We will accept late psets up to two days late (1pm Friday) for 75% credit (i.e. your score will be multiplied by 0.75). We may consider other arrangements for late psets due to rare extenuating circumstances on a case-by-case basis, and generally only if you have discussed the circumstances with us in advance. Like, maybe you have to miss a week because of something else important that you're doing. Work that out with us in advance!

We will drop your lowest pset grade and replace it with your highest grade from a subsequent pset. This is different from just dropping your lowest grade. The aim is to incentivize you to keep trying to improve throughout the course, right up to the final pset.

We expect that people taking the course have a range of different backgrounds. Some people have extensive Python programming experience, and some people have never programmed in any language before. The course is meant to be accessible to both sorts of people. If you don't have much programming experience, we will cut you slack, especially especially early in the course. That'll make the grading a little subjective. We don't see grades as some sort of objective professional-competency ruler; our grades are candy to arouse your competitive spirit. We're not gatekeepers; we're personal trainers. We want you to work hard and stretch yourself. We want your grade to reflect what you get out of the course, not what you brought into it.

Materials and access

There is no required textbook for the course. All lecture notes will be available online here on the mcb112.org web site. By Monday of each week, we'll post that week's lecture notes and the pset. On weeks that a pset is due, by about Friday we'll post answers. Links to all these appear in the navigation sidebar, and in the online schedule.

Lectures will be recorded. They'll appear shortly after each lecture in the Panopto tab of the Canvas site.

There is a Ed Discussion forum for asking questions (and getting answers, even!). The entire teaching team will be active on this forum for a wide range of different hours.

You need to have access to a computer (laptop or otherwise) that you can install a Python scientific data analysis environment on. We recommend the free Anaconda distribution. You should install Anaconda ahead of time. If you don't have a computer, the course has a limited number of Mac laptops available for lending. Do not hesitate or be embarrassed to ask me for one. (When I was in college, there was no way I could afford a computer; one of my professors kindly lent me one!)

You'll turn in your pset each week as a single Jupyter Notebook .ipynb file, submitted to the "Assignments" tab of our Canvas page.

There are many on-line resources for learning Python, but for books, we recommend:

  • Mark Lutz, Learning Python. O'Reilly and Associates, 2013. (5th edition)
  • Wes McKinney, Python for Data Analysis. O'Reilly and Associates, 2017. (2nd edition)

The Harvard library has electronic copies of both books available for students in the course. Our Canvas page links to them in the "Library Reserves" tab.

Getting started with Python

If you haven't programmed before and you want to get a head start on learning Python, I recommend the Harvard College Python Boot Camp. This is a short, online four-lesson introduction - asynchronous, you can do it any time that's convenient for you.

Academic integrity

You must do each week's sequence analysis project individually, on your own, rather than working collaboratively in groups. Your writing and your code must be your original work.

A goal of the course is to teach you how to understand and do biological data analysis yourself, without relying on interdisciplinary collaborations between people of disparate skills and interests. This is what the weekly sequence analysis projects are designed to push you to do.

As you're learning, though, you are free to talk with each other, to consult any resource, and to study code from other sources. This is how we learn anything. It's how you'll learn at every step of your future. We trust you to know the difference between asking a friend for help versus copying their answer.

The principle is that although we encourage you to learn in any way that you prefer, each week you must reach the point where you can understand and execute your work independently and originally; this independence and originality is what we'll be looking for.

We trust you. We expect you to act with honor and integrity. For example, it would not be hard to find previous versions of the course notes and problem sets online, but you should not go looking for them, and we trust you not to. We expect you are taking the course because you want to learn.

Auditing the course; pass/fail option

Auditing the course is not practical or meaningful. The course is entirely structured around the weekly psets and various ways of having the teaching staff help people do them. We only have teaching staff sufficient for registered students. That said, anyone is welcome to follow along with the material posted here at mcb112.org, or to sit in on the course lectures (space permitting).

You can take the course pass/fail. It's no problem from our perspective, but check with your concentration advisor to make sure it's ok from theirs. If you're taking MCB112 towards your concentration, you generally need to be taking it for a grade. There's a petition form from the registrar that I sign for you.

Simultaneous enrollment

If you want to take another course with lecture times that overlap with MCB112, the registrar may require you to get a signed "simultaneous enrollment" agreement. We cannot sign these. The registrar requires us to provide "hour-for-hour direct and personal compensatory instruction", which we are unable to do.

I really don't recommend signing up for an overlapping course. You should plan to attend lectures and engage with MCB112 fully. But in my view, it's up to you; provided that you're able to enroll with the registrar, you can navigate through MCB112 as you see fit.

Accommodations for disabilities

Students needing additional accommodations because of a disability should present their Faculty Letter from the Disability Access Office (DAO) and speak with Sean by the end of the second week of the term for us to be able to help you in a timely manner.