MCB112 Biological Sequence Analysis

Lectures: Mon/Weds 3:00-4:15pm, Fri 3:00-3:50pm
Location: Biolabs 1080 lecture hall
Starting: Weds 4 September

teaching team office hours location
Prof. Sean Eddy Thu 1-3p Biolabs 1033
Liana Merk Mon 1-3p Biolabs 1033
Jenya Belousova Tue 4-6p Biolabs 1033
Misha Gupta Fri 11a-1p Biolabs 1033
Jakob Heinz Fri 1p-2p / Tue 3-4p Biolabs 1033 (Fri) / HMS TMEC atrium (Tue)
Josh Park Mon 4:15p-6:15p Biolabs 1033

Office hours and locations subject to change in the first few weeks as we settle in. Doublecheck this page before you go. You can also ask questions and get help on our Ed Discussion board for the course.

Description

Biology has become a computational science, requiring analysis of large data sets from genome sequencing and other technologies. This course teaches computational methods in biological sequence analysis, using an empirical and experimental framework suited to the complexities of biological data, emphasizing computational control experiments. The course is primarily aimed at biologists learning computational methods, but is also suited for computational and statistical scientists learning about biological sequence data.

Aims and objectives

MCB112 teaches fundamental principles of biological data analysis using realistic but pint-sized research examples. The course is built around 12 weekly data analysis problems. This year, our unifying biological thread is bacteriophage (bacterial virus) genome sequence analysis. Many of the phage genomes we will be using were collected by undergraduates in HHMI-sponsored SEA-PHAGES courses.

In the course of solving analysis problems, you will learn practical skills in how to write Python scripts to analyze biological data, and how to use simulations to do computational positive and negative control experiments. The course is taught in Python, using Python-based data science tools including NumPy, SciPy, Pandas, and Jupyter Notebook. You will learn how to do computational science: how to understand computational methods, how to design computational experiments, how to think critically, how to develop an organized work pattern, and how to communicate results reproducibly.

You will also learn some fundamentals of probabilistic inference, statistics, computer science, and applied math -- how to think about statistics from first principles, and how to read and understand an algorithm well enough to implement it. A not-so-hidden agenda is to motivate biologists to learn more mathematics, computer science, and statistics, by showing how these skills are relevant to biological data analysis.

Prerequisites and background

There are no formal prerequisites. We expect students to come from different backgrounds -- a mix of biologists, computer scientists, and applied mathematicians -- and to have varying degrees of experience in writing code in Python. The course is designed to bring students up to speed in any area that they haven't seen much of before.

MCB112 is designed to be a course that could come before other rigorous coursework in biology, programming, statistics, and applied math, even though we do a mix of biology, programming, stats, and math in the course. Underlying the course's design is a philosophy that a biologist (indeed anyone) is perfectly capable of learning enough math, programming, and statistics to do sophisticated data analyses, but many of us have trouble building up abstract skills without first knowing why we're doing it. MCB112 emphasizes practical data analysis problems, and though at times you may feel like you've been dropped into the deep end of the pool, you will learn by example why math/programming/stats skills grant you mutant superpowers for modern biology research. In part, we judge the success of MCB112 by how many of our students go on to take coursework in fields they wouldn't have dreamt of studying before.

However, all that said -- it would be tough to come into the course with no background at all. We expect you to have course background in either the molecular biology side or the stats/math/programming/CS side. We do molecular biology at the level of LS1; Python programming around the level of AM10, CS109, or CS50; statistics around the level of STAT110 and STAT111; multivariate calculus and linear algebra around the level of MA21 or AM21; and a wee taste of data structures and algorithms. The more of these things you've taken already, the easier MCB112 will be.

Flow of the course

Most of the work is outside of class on your own, working on the weekly data analysis problems.

The Monday lecture each week covers fundamental background you need to know for that week's problem. We expect you to start thinking about your approach to the analysis problem after the Monday lecture.

The Wednesday lecture will start to get more practical. I'll cover any computational or mathematical techniques that you need for the pset. We expect you to come with whatever questions you have from thinking about the problem so far, for discussion.

The Friday class is run by the teaching fellows, and is even more practical. They will typically walk through the problem set and give you more framework for how to approach it. They will walk you through suggested approaches and resources. For example, especially in the early weeks of the course when people are coming up to speed, we will show Python code examples of related problems. Because we expect some people will have never coded before, we'll aim to bring you up to speed quickly and practically by providing working Python code examples that you can study and adapt.

Problem sets come out each Monday. Your solution to each week's problem is due the following Wednesday at 1pm (in 9 days time). You can always turn in a pset early if you despise the overlap, but we figure that giving you a couple of days after the weekend is more civilized than making the pset due on Monday. It also gives you some time to ask us more questions on Monday/Tuesday. You submit your solution by uploading a Jupyter notebook page to our Canvas.

Grading

The grade is based entirely on the weekly data analysis problems. There are no exams or finals.

Each weekly problem is graded on a scale of 1-10:

10 proficient
9 correct but minimal comments or explanation
8 very minor errors or missing comments or explanation
7 minor errors or partially incorrect answers
6 major errors or incorrect answers
5 pseudocode with good comments, but no implementation (i.e. good pseudocode gets at least half credit)
4 minimal pseudocode
3 no pseudocode, but minimal idea sketching
2 any attempt
1 no answer

We will give extra credit of 0.5 or 1.0 for particularly excellent answers - because you went above and beyond in writing a clear description, because your work or your code shows exceptional creativity, or because you did additional analyses or visualizations.

The final letter grade is an unweighted mean of 12 weekly grades:

\(\geq\) 9.3 A
[9.0..9.3) A-
[8.7..9.0) B+
[8.3..8.7) B
[8.0..8.3) B-
[7.7..8.0) C+
[7.3..7.7) C
[7.0..7.3) C-
[6.7..7.0) D+
[6.3..6.7) D
\(<\) 6.3 F

We will accept late psets up to two days late (1pm Friday) for 75% credit (i.e. your score will be multiplied by 0.75).

We will drop your lowest pset grade and replace it with your highest grade from a subsequent pset. This is different from just dropping your lowest grade. The aim is to incentivize you to keep trying to improve throughout the course, right up to the final pset. (If your lowest pset grade is the last one, such that there's no subsequent one to replace it with, we'll just drop it and average your best 11 - equivalent to replacing it with your average pset grade from the rest of the weeks.)

We expect that people taking the course have a range of different backgrounds. Some people have extensive Python programming experience, and some people have never programmed in any language before. The course is meant to be accessible to both sorts of people. If you don't have much programming experience, we will cut you slack, especially especially early in the course. That means the grading is a little subjective. We don't see grades as some sort of objective professional-competency ruler; we use grades as candy to arouse your competitive spirit. We're not gatekeepers; we're personal trainers. We want you to work hard and stretch yourself. We want your grade to reflect what you get out of the course, not what you brought into it.

Absence policy

I expect you to attend classes in person, to ask questions, and to participate actively in discussion. You'll get way more out of the course. However, you can miss class if you need to. You don't need permission; we trust you to manage your own schedule.

Lectures are recorded. They appear shortly after each lecture in the Panopto tab of the Canvas site. If you have to miss a class, you can catch up using the lecture recording, combined with the on-line lecture notes.

You have 9 days for each pset, so we expect you to be able to plan accordingly if you know in advance that you need to miss some time during a week. However, if you need additional arrangements because of a longer absence or a sudden illness, we will consider such rare extenuating circumstances on a case-by-case basis. Our standard accommodation for an unusual absence is to grant a two-day extension for the pset to Friday 1pm. (Same as the standard late extension, but allowing 100% credit instead of 75%.) Because of the weekly rhythm of the course, it is impractical to grant longer extensions. In extreme situations, our next level of accommodation is to allow you to skip a pset altogether. I think we have only done this once ever in the course.

Materials and access

There is no required textbook for the course. All lecture notes will be available online here on the mcb112.org web site. By Monday of each week, we'll post that week's lecture notes and the pset. On weeks that a pset is due, by about Friday we'll post answers. Links to all these appear in the navigation sidebar, and in the online schedule.

There is a Ed Discussion forum for asking questions (and getting answers, even!). The entire teaching team will be active on this forum for a wide range of different hours.

You need to have access to a computer (laptop or otherwise) that you can install a Python scientific data analysis environment on. If you're starting out, we recommend the free Anaconda distribution. You should install Anaconda ahead of time. If you don't have a computer, the course has a limited number of Mac laptops available for lending. Do not hesitate or be embarrassed to ask me for one. (When I was in college, there was no way I could afford a computer; one of my professors kindly lent me one!) You can install your python environment any other way you want too. (I use pip, myself.)

You'll turn in your pset each week as a single Jupyter Notebook .ipynb file, submitted to the "Assignments" tab of our Canvas page.

Getting started with Python

Never written in Python before? Never done any computer programming before at all? The course is set up to give you an on-ramp of 3-4 weeks to come up to speed.

If you are good at self-teaching, start with the online tutorial from the Python team, and bookmark the official Python documentation.

If you want to take a free on-line course, one highly regarded one is:

and Harvard CS50 is also freely available on-line, and so is MIT's "Introduction to Computer Science and Programming in Python".

If you learn best from books, two books we recommend are:

  • Eric Matthes, Python Crash Course. No Starch Press, 2023. (3rd edition) [amazon]
  • Wes McKinney, Python for Data Analysis. O'Reilly and Associates, 2022. (3rd edition) [website] [amazon]

We're aiming to have both electronic copies of both books available through the Harvard Library. (The library is processing our request now.) The course Canvas page links to them in the "Library Reserves" tab.

Academic integrity

You must do each week's sequence analysis project individually, on your own, rather than working collaboratively in groups. Your writing and your code must be your original work.

A goal of the course is to teach you how to understand and do biological data analysis yourself, without relying on interdisciplinary collaborations between people of disparate skills and interests. This is what the weekly sequence analysis projects are designed to push you to do.

As you're learning, though, you are free to talk with each other, to consult any resource, and to study code from any other sources. This is how we learn anything. It's how you'll learn at every step of your future. We trust you to know the difference between asking a friend for help versus copying their answer.

Your "friends" include ChatGPT and other AIs. We encourage the use of AIs in this course. Learn to use them. Learn their strengths and weaknesses. They make great suggestions and help you to learn things you didn't know you could do. They will also make horrific and stupid mistakes. You can't trust the code they generate. You don't want to copy it, because it will likely be wrong. Study what the AI does, learn from it, then put it aside and write your own code.

The principle is that although we encourage you to learn in any way that you prefer, each week you must reach the point where you can understand and execute your work independently and originally; this independence and originality is what we'll be looking for.

We trust you. We expect you to act with honor and integrity. For example, it may not be hard to find previous versions of the course notes and problem sets online, but you should not go looking for them, and we trust you not to. We expect you are taking the course because you want to learn.

Auditing the course; pass/fail option

Auditing the course is not practical or meaningful. The course is entirely structured around the weekly psets and various ways of having the teaching staff help people do them. We only have teaching staff sufficient for registered students. That said, anyone is welcome to follow along with the material posted here at mcb112.org, or to sit in on the course lectures (space permitting).

You can take the course pass/fail. It's no problem from our perspective, but check with your concentration advisor to make sure it's ok from theirs. If you're taking MCB112 towards your concentration, you generally need to be taking it for a grade. There's a petition form from the registrar that I sign for you.

Simultaneous enrollment

If you want to take another course with lecture times that overlap with MCB112, the registrar may require you to get a signed "simultaneous enrollment" agreement. We cannot sign these. The registrar requires us to provide "hour-for-hour direct and personal compensatory instruction", which we are unable to do.

I really don't recommend signing up for an overlapping course. You should plan to attend lectures and engage with MCB112 fully. But in my view, it's up to you; provided that you're able to enroll with the registrar, you can navigate through MCB112 as you see fit.

Accommodations for disabilities

Students needing additional accommodations because of a disability should present their Faculty Letter from the Disability Access Office (DAO) and speak with Sean by the end of the second week of the term for us to be able to help you in a timely manner.