pset 07: the adventure of the elusive motif

G. Lestrade, the well meaning but bumbling postdoc in your lab, has been trying to develop a method for automatically identifying a PWM (position-specific probabilistic weight matrix) for a consensus ribosome binding site, given a set of sequences just upstream of the annotated translation initiation codon of each coding region in a phage genome.

He has been using three phage as his examples: phage T4 (169kb, 35% GC), and two SEA-PHAGE, AluminumJesus (62kb, 67% GC) and HangryHippo (132kb, 35% GC). He has extracted a set of upstream sequences from each phage, 15 to 40nt upstream of the annotated start codons for every annotated coding gene. You can download his files here:

Based on what we know of bacterial ribosome binding sites (RBS), we expect the RBS to be about 4-8nt long, and we expect an approximate probabilistic consensus of something like GGAGG.

Lestrade has been trying to use k-mer abundance to detect an enriched sequence motif in these upstream regions. However, he is struggling with highly biased genomes like T4 (which is AT-rich) and AluminumJesus (which is GC-rich), where simple k-mer abundance is being confounded by abundant AT-rich kmers in T4 and GC-rich k-mers in AluminumJesus. For example, in the T4 dataset, the most abundant 5-mers are TATAA, AGGAA, GAGGA, and ATAAA; he suspects the AGGAA and GAGGA might correspond to RBS's, but the TATAA and ATAAA seem unrelated and more likely to be just AT-rich background. And of course, exact-match k-mers aren't the most powerful way to go anyway, since we're expecting a probabilistic consensus instead of an exact one.

You tell Lestrade that probabilistic inference using an expectation-maximization algorithm might be the way to go, and you set out to show him how to do it.

1. Implement a expectation-maximization motif finding algorithm

Following the lecture notes from this week, implement an expectation-maximization algorithm for identifying a sequence motif.

Have your implementation take the width of the motif W as an input argument, so you can vary this easily in your analyses.

The pieces of your implementation will likely include:

  • a routine to calculate a posterior distribution for the inferred motif position on an input sequence

  • a routine to collect expected counts at each position of a new position-specific motif matrix, given those posterior distributions for each input sequence

  • a routine to take a position-specific motif matrix containing these expected counts, and estimate probabilities from those counts

The first two routines are the "expectation" step (infer motif positions and expected counts, given the current motif model). The last routine is the "maximization" step (infer a new motif model, given expected counts).

To start the algorithm, you will also need a routine to initialize a new random motif matrix, using one of the options discussed in class and in the lecture notes.

To end the algorithm, you will need to implement a convergence criterion of some sort. Even a simple one of running a fixed number of iterations will suffice here, if you choose a large enough number.

Have your implementation calculate and report the relative entropy $\sum_{k=1}^W \sum_a p_k(a) \log_2 \frac{p_k(a)}{f(a)}$ of the identified PWM $p_k(a)$ (positions $k$, residues $a$) compared to the background residue frequencies $f(a)$. This relative entropy is a suitable "score" to report for the identified motif.

Remember that EM is a local optimizer. You want to run it at least a few different times from different random starting points, and choose the best solution. One good definition of "best" would be the highest scoring (highest relative entropy) motif. Another definition would be the model with the highest $\log P(\mathrm{data})$.

2. Do a synthetic positive control

Create a known motif PWM of some length W (4-8nt, say). To make a synthetic positive sequence, embed one sampled motif from the PWM at a random position in a larger random (i.i.d.) sequence of length L of a given background residue composition. Generate a synthetic positive control dataset of N sequences of length L, for N and L comparable to the size of Lestrade's datasets.

Analyze your synthetic positive control dataset with your EM algorithm, choosing W appropriately. Do you identify the motif that you embedded in the sequences?

3. Do a synthetic negative control

Generate a synthetic negative control dataset with similar characteristics as the dataset above, but without embedding any motif (i.e. all N sequences are just iid random sequences of length L, of some given background composition).

Analyze your synthetic positive control dataset with your EM algorithm, for an appropriate choice of W that you can compare to your results above. What is the score of the best identified motif in negative control data, compared to positive control data?

4. Identify the ribosome binding sites in Lestrade's phage

Analyze Lestrade's three datasets, for an appropriate choice of W. What is your best inference for the ribosome binding site consensus motif in each phage?

hints

For my positive controls, I generated about N=200 sequences of length L=50. Other choices are reasonable too, so long as they're in the ballpark of the real datasets. For my synthetic motif PWM, I started from a consensus sequence, and made a PWM with probability $p$ for the consensus residue at each position of the motif, and $\frac{1-p}/3$ for the other residues. For my background probabilities, I tried different ones using the GC% composition of Lestrade's three phage.

For choosing W, I suggest exploring different choices from 4-8. If you choose a W that's smaller than the conserved motif, you will tend to find overlapping subsets of the motif in different EM runs. For example, if the consensus is GATTACA and you choose W=5, different runs may identify GATTA, ATTAC, and TTACA. If you choose a W that's larger than the conserved motif, the motif you infer will have some random-ish extra positions on either end. For example, if the true consensus is GATTACA and you choose W=8, different runs may identify nGATTACA and GATTACAn.

allowed modules

Our list of allowed modules now includes NumPy, Matplotlib, Pandas, Seaborn, and SciPy's stats and special modules.

For this one, you won't need Pandas, and you only need to plot if you really want to.

You may want SciPy's special.logsumexp() for summing probabilities that you're storing as log probabilities.

import numpy as np
import matplotlib.pyplot as plt     # optional
import scipy.stats as stats         # probably don't need
import scipy.special as special     # maybe, for logsumexp()
import seaborn as sns               # optional
import pandas as pd                 # optional

%matplotlib inline

turning in your work

Submit your Jupyter notebook page (a .ipynb file) to the course Canvas page under the Assignments tab. Name your file <LastName><FirstName>_<psetnumber>.ipynb; for example, mine would be EddySean_07.ipynb.

We grade your psets as if they're lab reports. The quality of your text explanations of what you're doing (and why) are just as important as getting your code to work. We want to see you discuss both your analysis code and your biological interpretations.

Don't change the names of any of the input files you download, because the TFs (who are grading your notebook files) have those files in their working directory. Your notebook won't run if you change the input file names.

Remember to include a watermark command at the end of your notebook, so we know what versions of modules you used. For example,

%load_ext watermark
%watermark -v -m -p jupyter,numpy,matplotlib