MCB112: Biological Data Analysis (Fall 2017)

week 09:

maximizing your expectations

reading for this week

probabilistic graphical models

Figure 1 from Li et al. (2010) is a probabilistic graphical model, also known as a Bayesian network. A lot can be said about probabilistic graphical models – one good doorstopping textbook is from Koller and Friedman – but we only need to know some basic essentials right now, and the basic essentials are straightforward.

The directed probabilistic graphical model from Li et al (2010). "Directed" means there are arrows representing dependencies as conditional probabilities. as opposed to "undirected" graphs which only show dependencies as connections (undirected edges).

We use a graphical model as an intuitively powerful way to state our assumptions about what the random variables are, and where there statistical dependencies are (and aren’t) in an inference problem. Specifically, in a so-called directed graphical model, we use nodes (random variables) and arrows (directed edges, representing conditional probability dependencies) to think about how we’re going to start with an (often nastily complex) joint probability distribution over all our random variables, and factor that down into a product of conditional probability distributions that we can work with.

the biological problem: ambiguous read mapping

Let’s start with the specific example of Li et al. (2010). First let’s state the RNA-seq analysis problem they’re trying to deal with, in terms of how we imagine that the observed data are generated. That is, let’s state our generative probability model for the RNA-seq data.

The problem that they’re trying to tackle is that just because I observe a read sequence , I don’t necessarily know which transcript isoform to assign that read to. Transcripts can share identical stretches of sequence because of overlapping isoforms, or because of repetitive elements. Multimapped reads are reads that the read mapping software can’t place uniquely, instead mapping them to two or more possible sources. Li and Dewey’s framing distinguishes gene multireads that map to different genes (because of repetitive elements) from isoform multireads that map to different isoforms of the same locus (because of overlapping exons).

Making the problem worse, DNA sequencing isn’t 100% accurate. We have to worry about similar matches, not just identical matches. Read mappers have to tolerate a small number of mismatches, typically one or two. Now a read might map to similar sequences, not just identical ones; and now we also have useful probabilistic information about the source. Depending on a sequencing error model that we could parameterize, we could guess that the read’s true source is more likely to be more similar (or identical), compared to a less similar potential source. It’s even possible, with enough sequencing error, that we could mismap a read only to false sources, and miss the true one.

the generative model

Li et al. state their generative model for one observed read as follows.

the random variables

Thus we have the following random variables:

In any modeling problem, some of our variables are observed data, some of our variables are unknown model parameters, some are fixed model parameters, and some are hidden (or latent) variables that represent the process by which the model generates observed data, but are not directly observed themselves.

Here, the read sequences are observed data. are the unknown model parameters. The process that gets us to observed data requires specifying three hidden variables: the transcript source , the start position , and the orientation .

The transcript lengths are an example of fixed model parameters that we assume are known. We could keep them in our conditional probability expressions below, to be very explicit, but for conciseness, we’re not going to.

the likelihood of the model

The likelihood of the model is the probability that the model generates observed data , given the model parameters :

but with our hidden variables in play, what we actually know how to write down is

the joint probability of the data and the unknown hidden variables. To get the likelihood, we have to marginalize (sum out) the unknown hidden variables:

Hidden variables are sometimes called nuisance variables: variables that have to be specified in order to calculate a likelihood as an actual number, but whose values are not relevant or interesting to us (at least at the moment), so we marginalize over their uncertainty to make them disappear.

factorizing and stating independence assumptions

If we literally needed to specify , we would have to specify (or at least imagine) an enormous multidimensional table of probabilities for all possible combinations of outcomes even if we only had just one choice of . Then we’d have one of those tables for each possible choice of . Since is a vector of continuous real-valued probabilities, we’d have an infinite number of tables. This is not good.

This is why we stated the generative model as a succession of steps:

So, let’s do this. Without making any assumptions (yet), we can correctly factorize our joint probability into a product of conditional probabilities in different ways, including one that corresponds to the order that we’re thinking of our generative process in:

This gives us our joint probability in terms of a product of conditional probabilities of our individual random variables.

Now we can walk through each conditional probability term and formally state our independence assumptions, which make our model tractable. From right to left:

Now look at the Li et al Figure 1 again. Its nodes and arrows show exactly these conditionings (after dropping the conditioning on fixed model parameters like because they’re fixed). The label on a node is the stuff to the left of a , the probability distribution of this random variable; the arrows that come into the node are the stuff to the right, the conditional dependencies. A directed graphical model is just a convenient pictorial representation of the conditional dependencies we want to assume in our model.


Given a probability model, there are lots of things I might want to do with it. For example, given observed data , I might want to know the posterior distribution of the unknown parameters . I’d write down Bayes’ rule for getting the posterior distribution , in terms of the marginal likelihood. I’d be forced to specify a prior probability distribution to do this.

If we could specify the prior and brute force enumerate or sample the space of all possible parameters (like in the Student’s game night pset, where we used coarse-grain discretization of two parameters and ), we could get this posterior probability distribution by calculating the likelihood times prior for all possible choices and normalizing. But here is an M-dimensional probability vector, for on the order of tens of thousands, the number of different transcript isoforms in the transcriptome. We’re not going to enumerate all the choices, even if we did assume a coarse-grained discretization of some sort. Even supposing we did some sort of binary discretization like high/low, there’d still be possible settings for we’d have to do.

So this posterior inference problem is still a high class problem for us – buzzwords that we’ll see for solving problems like that are things like Gibbs sampling and Markov chain Monte Carlo – and we’re going to sneak up on it instead. Suppose, instead of getting the posterior distribution for , I was happy to just make a point estimate for an optimal .

The first thing we have to worry about is, what do you mean optimal?

maximum likelihood parameter estimation

We’ve seen that a standard definition of optimal is maximum likelihood: find the values of parameters that maximize the likelihood for observed data .

Suppose I told you the count of reads that truly map to transcript . Then you’d simply estimate . That’s the maximum likelihood estimate for – but I gave you the counts , which means I had to give you the hidden values of for all the reads . If we want to be super formal we can write these unknown “true counts per transcript” as:

where is a so-called Kronecker delta function, often seen as a notational trick in counting subsets of things: it takes the value 1 when its condition is true, and 0 when its condition is false. Here, “count all the reads that are assigned to transcript ; that sum is ”.

Your problem is that you don’t know , because the values of the are hidden and unknown.

inference of hidden variables

But, given observed data , and if I told you , you could infer . By manipulating the likelihood , you can solve for in terms of :

where the denominator there is just a renormalization, a marginal sum over the numerator, :

Alas, we don’t know how to put numbers to ; our model is such that we can put numbers to , in terms of additional nuisance variables (the read start position) and (the orientation). We get our likelihood term by marginalization over the nuisance variables:

So the computation would be to just calculate the probability for each possible assignment of observed read to transcript – now you have a vector of likelihoods – and you renormalize them to get the distribution .

This is your uncertainty about the assignment of observed read to underlying transcript . Now what you could do is assign each transcript a fraction of a counted read, weighted by that uncertainty. You’d be calculating the expected value of the counts , given your uncertainty about the mapping:

And then you could renormalize the to get your maximum likelihood estimate of the .

Which is all just great, except that you had to know in the first place to get your expected counts , but that’s what we’re trying to estimate!

expectation maximization

We have a chicken and the egg problem. If I told you the unknown , you could estimate the unknown by maximum likelihood. If I told you the unknown , you could estimate the unknown by probabilistic inference from the observed read data . But both the and are unknown.

Amazingly enough, the following apparently dumb solution works:

As we’ve seen before with in k-means week, this is Expectation Maximization, or EM for short. It was formally and generally introduced in a famous paper by Dempster, Laird, and Rubin (1977), though as you might imagine from its apparent simplicity, it had shown up in various guises before.

Generally, EM is a local optimizer. For many problems that it is applied to, different starting guesses may yield different optima, if the posterior probability surface for is rough. It happens that in this case, Li et al. (2010) prove that there is only a single optimum for (“we prove that the likelihood function for our model is concave with respect to ”, in the jargon), so EM reaches a global optimum.

motif identification by expectation maximization

In the Li (2010) problem, we have three latent variables () We only need to infer counts for one of them (). We have to push around and get them marginalized out. In the homework problem this week, we work through a more abstracted version of the problem that gets rid of the orienttation and the sequencing error model , reducing the problem to a single hidden variable (the unknown transcript isoform that a read came from). We saw that k-means clustering was also in terms of a single hidden variable (the unknown cluster assignment for a data point). To beat this horse thoroughly, it might be useful to see another example of an important biological data analysis problem where there’s only one hidden variable other than our observed data and our unknown model parameters .

Here’s the most basic case of the motif identification problem: we’re given DNA sequences . Each individual DNA sequence is a string of residues, .

Embedded somewhere in each DNA sequence is exactly one motif, of length . A motif might be a protein binding site, for example. A DNA-binding protein prefers certain sequences, and many DNA-binding proteins, including transcription factors, have distinctive site-specific preferences for individual residues: for instance we might have a protein that likes GaattCC, where it cares a lot more about the G and CC than it cares about the aatt.

is short: typically 6-10 residues for typical DNA-binding proteins. is long-ish: experiments like ChIP-seq experiments (where we recover DNA fragments bound to a protein of interest, and sequence the bound fragments) might give us sequences of length .

The game is to identify the W-length motif that all the sequences share, even though the motif is a probabilistic pattern of preferences, not merely the same identical subsequence everywhere.

A good approximate model of the preferences of a DNA-binding protein is a position-specific weight matrix (PWM). The PWM is a generative probability model of a sequence of length . In a PWM, we assume that each residue is generated independently, with probabilities for A/C/G/T at that position, i.e.:

so therefore, the PWM consists of Wx4 parameters , 4 parameters for the probability of getting A,C,G, or T at each position .

We can specify the unknown location of the motif in each larger sequence by its start position . The possible start positions for range from (there’s an edge effect; you can’t start too close to the end and have a complete motif). So each sequence is composed of three segments: the first are random, the motif is a sample from the PWM, and the remaining are random.

If we wanted to be super compulsive, we could now state the probability of generating an observed sequence , by stating that the random (non-motif) residues in it are generated with probability 0.25 for each base A,C,G,T. Then:

but you’ll notice that there are always random non-motif residues, so really this is just a constant times the probability that the PWM assigns to the possible motif starting at position :

and we can even drop the constant and just deal with a proportionality, since it’ll turn out that we’ll be renormalizing the likelihood for other reasons anyway:

So there’s our data likelihood for one sequence : the probability of that contains one motif at position , given the PWM . The likelihood for independent sequences is the product over each.

If I told you the unknown for a data set , you could extract the individual motifs, align them (put them in register, in columns 1..W), and collect the observed counts of residues for each motif position and each residue . Given those counts, your maximum likelihood estimate for the PWM is just to use the residue frequencies at each position:

That is, if you find the motifs in 100 sequences, and at motif position 1 you see 80 A, 10 C, 10 G, and 0 T, then you estimate probabilities .

If instead I told you the unknown PWM , then you could calculate your expected counts via inferring the unknown position of the motif in each sequence:

is the data likelihood, which we already stated above. is the prior for , and we can state that a priori (before we’ve seen any sequence , all positions are equiprobable (), thus constant, and this constant is present in both the numerator and all terms of the denominator, so it disappears. (Similarly the proportionality constant we talked about for our likelihood is going to disappear.) The denominator is just the sum of the numerator over all choices of position . So:

which means, in pseudocode-ish words:

Given a posterior distribution over the uncertain motif location for each observed sequence , you collect expected counts by apportioning one total motif count per sequence across the possible according to the uncertainty:

which may look harder than it really is, with that Kronecker delta in the notation. In words:

The EM algorithm is now essentially the same as we saw above for the RNA-seq inference problem:

bonus: sampling random probability vectors from the Dirichlet

An interesting probability sampling problem arose when we said things like “initialize the starting to something random”. We don’t need to sample our starting guess uniformly for the Li et al (2010) EM algorithm – any guess will do – but suppose we did want to make a completely random guess at ? How do we choose a uniformly random multinomial probability vector , for size ? We’re talking about sampling from a specified probability distribution over probability vectors, . We’re going to need to think about such things, because in some Bayesian inference problems, we’re going to want to specify nonuniform priors over model probability parameters. Thinking about how to sampling from a uniform is one place to get started.

You might think you could just sample each individual uniformly on 0..1 (using Python’s np.random.uniform, for example) – i.e. , for a uniform random variate , but of course that doesn’t work, because they’re constrained to be probabilities that sum to one: .

One bad way to sample probability vectors uniformly, and two good ways. The examples show 1000 samples, plotted in so-called "barycentric" coordinates (also known as a "ternary plot"), to show the 3D simplex in 2D. The top panel (the bad way) shows sampling by drawing uniform variates and renormalizing. The middle panel shows sampling from a Dirichlet distribution with all alpha=1. The bottom panel shows sampling uniform variates, taking their log, and renormalizing.

So then you might think, fine, I’ll sample elements uniformly then renormalize the vector: . But this doesn’t sample probability vectors uniformly, as shown by the clustering of samples in the middle of in the top panel of the figure to the right.

The right way to do it is to use the Dirichlet distribution. The Dirichlet is a distribution over multinomial probability vectors:

where the normalization factor is:

With all , all probability vectors are sampled uniformly, because the term becomes a constant for all values of .

So if you go over to Python’s np.random.dirichlet, you’ll find that you can do:

   import numpy as np
   theta = np.random.dirichlet(np.ones(K))

and you’ll get a uniformly sampled probability vector of size . That’s illustrated in the middle panel to the right.

Then it turns out that in the special case of all , you can prove that sampling uniformly from the Dirichlet is equivalent to taking the logarithm of uniform variates and renormalizing,

or in Python

   import numpy as np
   theta = np.random.exponential(size=K)
   Z     = np.sum(theta)
   theta = np.divide(theta, Z)

This special case might be useful to know, like if you’re programming in a language other than Python that doesn’t put a np.random.dirichlet so conveniently at your fingertips. The bottom plot of the figure to the right shows using the method.

If you want to play around with this, you can download the script I used to generate the figure.

The Dirichlet distribution will turn out to be useful to us for other reasons than just sampling uniformly-distributed probability vectors, as the next section delves further into.

a bit more on the Dirichlet

David MacKay often talked about the Dirichlet using an analogy to a “dice factory”: a Dirichlet generates probability vectors that you could use as parameters of a die roll.

The Dirichlet appears a lot in probabilistic inference when we’re working with likelihoods that are multinomial distributions (rolling dice, flipping coins, Laplace counting girl versus boy births in Paris, choosing an RNA isoform by its abundance in an RNA population) and we want to specify a prior. If we’ve got multinomial probability parameters that generated observed counts , and we want to specify the joint probability of everything , then we’ll break that down into a likelihood and a prior . The Dirichlet, with its parameters , gives us a mathematically convenient way to parameterize a prior as .

There are other ways one might parameterize a prior . The advantage of the Dirichlet is that it plays well with the multinomial; it is the multinomial’s conjugate prior. The multinomial is:

where the normalization constant is the reciprocal of the multinomial coefficient:

When you put that together with a Dirichlet prior to get a joint probability you get:

which is just as easy to handle as the multinomial alone when we need to do any integration or differentiation.

The binomial distribution is the special case of the multinomial, so you can also apply a Dirichlet prior to a binomial, and integrate the result using Beta functions just like Laplace integrated the binomial by itself.