week 08: hidden Markov models
What makes HMMs so popular is that the name is so tantalizing. Something is hidden and we're finding it and we have a Russian name to do it. - David Lipman (Science, 1996)
what's hidden about a hidden Markov model
Many sequence analysis problems fundamentally boil down to putting some sort of annotation label on each observed residue. For example, we might imagine that DNA residues inside a coding region have a statistical composition that is distinguishable from noncoding regions. We could make a probability model that exploits that statistical difference to infer coding vs. noncoding regions. For another example, in motif analyses (following from last week), we imagine that each DNA residue is generated either from the background frequencies or from one of the W positions of a position-specific weight matrix, which amounts to labeling each residue with a PWM position index or "none".
Even sequence alignment can be thought of as a labeling problem. Given a query sequence $y$ and a target sequence $x$, an alignment is basically labeling each residue in $x$ with the index of the position in $y$ it is aligned to, or "none" if it is an insertion relative to $y$.
In such problems, we are typically unable to directly calculate the probability $P(x \mid \theta)$ of an observed sequence $x$ given our model parameters $\theta$. This is because the probabilities we assign to the observed sequence depends on the state label of each residue. Given state labels $\pi$ for sequence $x$, we can calculate $P(x, \pi \mid \theta)$ for the joint probability of the sequence and a particular labeling. If we can calculate these terms for all possible labelings $\pi$, we can marginalize to calculate $P(x \mid \theta) = \sum_{\pi} P(x, \pi \mid \theta)$. The problem is that the state labels are typically unknown, and indeed are often precisely what we want to infer from the observed sequence data.
One of the simplest probability models for this sort of sequence labeling problem is a hidden Markov model. The observed sequence $x_1 \ldots x_L$ is assumed to be emitted one residue at a time from an underlying state path $\pi_1 \ldots \pi_L$. Each step along the state path to a new $\pi_i$ from a previous $\pi_{i-1}$ is assessed a state transition probability. Only the sequence $x$ is observed; the state path $\pi$ is hidden (unobserved). The state transition probability to a new state $\pi_i$ depends on exactly one previous state $\pi_{i-1}$ and no other previous context, so the state transition probabilities define a first order Markov chain. Thus, we have a "hidden Markov model": the underlying state path is a Markov model with first-order Markovian state transitions, but the observable data is a sequence generated one residue at a time from each state of that unobserved state path.
HMMs are typically drawn graphically. States are represented as nodes in the graph (circles or whatever). State transitions are represented as arrows between states. An emission probability distribution is attached to each state.
definition of an HMM
There are $M$ main states. Each state $k$ has transition probabilities $t_{kj}$ to one or more other states $j$ (possibly including itself). Each state $k$ also has emission probabilities $e_k(a)$ for each possible observable symbol $a$ in a symbol alphabet of size $A$. (A=4 for DNA, and so on.)
There are different ways to handle starting and ending a state path. One way to start is to specify an initial distribution over the $M$ states. A commonly used variant of this is to add a special "begin" state, which always starts with initial probability 1. The begin state is typically "mute" and does not use an emission distribution (although it may have one just to simplify notation or code).
One way to end is to specify that the HMM is going to generate a sequence of length $L$, and just stop when that length is reached. Another way to end is to have a special "end" state (which again would typically be mute) and to stop when the end state is reached.
A convention that's particularly convenient for compact notation and compact code is to include a special state 0 that functions as both begin and end. This state 0 does not use its emission distribution, and it does not transition to itself ($t_{00} = 0$). Transitions $t_{0k}$ to states $1..M$ are the initial starting distribution. Transitions $t_{k0}$ from states $1..M$ terminate the state path. State paths are indexed $\pi_{0}, \pi_{1 \ldots L}, \pi_{L+1}$, with the $\pi_0 = \pi_{L+1} = 0$, and each $\pi_i$ being the hidden state assigned to observed symbol $x_i$.
Using this convention for the start/end state 0, the HMM is thus fully described by an $(M+1) \times (M+1)$ transition probability matrix, and a $(M+1) \times A$ emission probability matrix for alphabet size $A$.
sampling from an HMM
An HMM is a generative probability model. The transition and emission probabilities (collectively denoted $\theta$) are sufficient to specify a distribution $P(x \mid \theta)$ over all observable sequences.
As with any generative probability model, as the name suggests, a good place to start is to imagine generating data from the model (as in a synthetic positive control). This exercise assures us that the model parameters $\theta$ do indeed suffice to specify a complete generative probability model of the data. HMMs provably assure that $\sum_x P(x \mid \theta) = 1$ as probabilities should. For small models with a finite and reasonably limited space of possible emitted $x$, we may also verify this empirically by brute force enumeration.
In pseudocode, sampling a sequence and a state path from an HMM goes like this:
pi[0] = 0
pi[1] = sample initial state k from t_0k
i = 1
while pi[i] != 0:
k = pi[i]
x[i] = sample residue a from e_k(a)
pi[i+1] = sample next state j from t_kj
i++
This gives us both a sequence $x_1 \ldots x_L$ and a (now not so hidden) state path $0, \pi_1 \ldots \pi_L, 0$.
The joint probability $P(x, \pi \mid \theta)$ is the product of all the transitions and emissions:
$$ P(x, \pi \mid \theta) = t_{0\pi_1} \prod_{i=1}^L e_{\pi_i}(x_i) t_{\pi_i,\pi_{i+1}} $$
Of course, our problem is usually the inverse of generating data from the model parameters. Our problem is typically that we are given one or more observed sequences $x$, and we need to infer something about them - maybe the hidden state paths $\pi$ that generated them (as in an optimal alignment problem or a sequence annotation problem), or maybe the probability of the sequence given the model, $P(x \mid \theta) = \sum_{\pi} P(x, \pi \mid \theta)$, or maybe the parameters of an entirely new HMM model of the data.
We rarely see a problem where we can brute force enumerate the $P(x, \pi \mid \theta)$ terms that we will be manipulating in our probability algebra. There are typically a nigh-infinite number of possible sequences $x$, and a combinatorial explosion of possible state paths $\pi$ for every one of them. Fortunately, there are efficient dynamic programming algorithms for working with HMMs.
the Viterbi algorithm
The Viterbi algorithm finds the optimal (maximum likelihood) state path $\hat{\pi}$. It is essentially comparable to optimal sequence alignment.
The Viterbi algorithm recursively calculates $\log P(x, \hat{\pi} \mid \theta)$: the log joint probability of a sequence $x$ and the optimal (maximum likelihood) state path $\pi$ for generating that sequence, given the HMM parameters $\theta$.
It does this using dynamic programming, recursively calculating the log probability of longer and longer sequence prefixes $x_1..x_i$ to some optimal state path that ends at state $k$: in notation, this is $\log P(x_{1..i}, \hat{\pi}{1..i-1}, \pi_i = k \mid \theta).$ We denote these partial solutions as $V$ and store them in a $(L+1) \times (M+1)$ dynamic programming matrix.
After the DP matrix is filled, the optimal state path is then reconstructed by backtracing through the matrix, by following the optimal choice at each DP cell. This backtrace is typically facilitated by keeping a second matrix of "traceback pointers", storing the $\mathrm{argmax}j$ that maximized each $V$ calculation.
In pseudocode, the Viterbi algorithm is:
# Initialization:
V[0,0] = 0
V[0,k] = -infinity for all k > 0
# Recursion:
for observed symbols i = 1 to L:
V[i,0] = -infinity
for states k = 1 to M:
V[i,k] = log e_k(x_i) + max_j (V[i-1,j] + log t_jk)
ptr[i,k] = argmax_j (V[i-1,j] + log t_jk)
# Termination:
log P(x,pi) = max_k V[L,k] + log t_k0
pi[L] = argmax_k V[L,k] + log t_k0
# Traceback:
for i = L down to 1:
pi[i-1] = ptr[i,pi[i]]
The algorithm takes $O(LM^2)$ time, $O(LM)$ space. Many useful HMMs have sparse transition matrices with a constant number of transitions per state, in which case the time complexity reduces to $O(LM)$. There are a variety of dynamic programming tricks that enable real-world (i.e. non-textbook) implementations in $O(M)$ memory.
the Forward algorithm
The Viterbi algorithm calculates the probability of sequence and just one state path, albeit the most likely one. Remarkably, we can obtain the sum over the ensemble of all state paths merely by replacing the max with a logsumexp, summing instead of maximizing at each step of the recursion.
The Forward value $F_{ik}$ is the log of the marginal probability of the sequence prefix up to position $i$, $x_1 \ldots x_i$, summed over all possible state path prefixes $\pi_1 \ldots \pi_{i-1}$, with $\pi_i = k$. In notation, this is $\log P(x_{1..i}, \pi_i = k \mid \theta)$.
In pseudocode, the Forward algorithm is:
# Initialization:
F[0,0] = 0
F[0,k] = -infinity for all k > 0
# Recursion:
for observed symbols i = 1 to L:
F[i,0] = -infinity
for states k = 1 to M:
F[i,k] = logsumexp_j (F[i-1,j] + log t_jk + log e_k(x_i) )
# Termination
log P(x) = logsumexp_k (F[L,k] + log t_k0)
The time and space complexity of Forward is the same as Viterbi.
the Backward algorithm
The dynamic programming recursions can also be arranged to run backwards, over longer and longer suffixes instead of prefixes. This is of little use in calculating $\log P(x \mid \theta)$, since we can already do that with Forward. The usefulness of the Backward calculation comes from what happens when we put it together with a Forward calculation to do what's called posterior decoding, in the next section.
The Backward value $B_{ik}$ is the log of the marginal probability of the sequence suffix from position $i+1$, $x_{i+1} \ldots L$, summed over all possible state path suffixes $\pi_{i+1} \ldots \pi_L$, with $\pi_i = k$. In notation, this is $\log P(x_{i+1..L}, \pi_i = k \mid \theta)$.
Note there's a crucial off-by-one in how we calculate $B_{ik}$. It doesn't include the emission probability of residue $x_i$. This is a bookkeeping issue that arises because the main use of Backward is in combination with Forward. We want to be able to add $F_{ik} + B_{ik}$ to get the probability of sequence $x$ summed over all state paths that generate residue $i$ with state $k$, and we don't want to double-count the log emission probability $e_k(x_i)$:
$$ \log P(x, \pi_i = k \mid \theta) = \log P(x_{1..i}, \pi_i = k \mid \theta) + \log P(x_{i+1..L}, \pi_i = k \mid \theta). $$
In pseudocode, the Backward algorithm is:
# Initialization:
B[L,k] = log t_k0 for all k
# Recursion:
for observed symbols i = L-1 down to 1:
B[i,0] = -infinity
for states k = 1 to M:
B[i,k] = logsumexp_j ( B[i+1,j] + log t_kj + log e_j(x_{i+1}) )
# Termination:
B[0,k] = -infinity for all k > 0
B[0,0] = log P(x) = logsumexp_j (B[1,j] + log t_0j + log e_j(x_1))
posterior decoding
Sometimes, we are most interested in the optimal labeling of an individual residue $x_i$. The Viterbi algorithm gives us the optimal state path: the state path with the highest overall probability. We can simply use the labeling $\hat{\pi}i$ from the Viterbi path as the label to infer for $x_i$. But in an important sense, this is actually not the optimal label. There may be (and usually are) many different state paths that all generate residue $x_i$ with state $k$. It's possible for a set of paths to collectively have a higher probability of generating $x_i$ from a different state $k$ than the one that happens to be in the single optimal path. We want to marginalize over that enesemble of paths to get the probability that $x_i$ is generated by state $k$, then choose the state with the highest ensemble probability. This is called _posterior decoding.
We obtain this probability by combining values we obtained in the Forward and Backward dynamic programming calculations over prefixes and suffixes, starting from where we left off above:
$$ \log P(x, \pi_i = k \mid \theta) = \log P(x_{1..i}, \pi_i = k \mid \theta) + \log P(x_{i+1..L}, \pi_i = k \mid \theta). $$
Mathematically, we use Bayes to invert this to get the posterior we want:
$$ \begin{eqnarray} P(\pi_i = k \mid x, \theta) & = & \frac{P(x, \pi_i = k \mid \theta)}{\sum_j P(x, \pi_i = j \mid \theta)} \ & = & \frac{P(x, \pi_i = k \mid \theta)}{P(x \mid \theta)} \ \end{eqnarray} $$
So in pseudocode, the Decoding algorithm is simply:
D_ik = exp ( F_ik + B_ik - log P(x) )
where we've obtained $\log P(x)$ from either the Forward or Backward algorithms.
Using these decoding probabilities, we can define a different kind of optimal state path, one in which we choose the highest probability decoding at each individual position $x_i$:
for i = 1 to L:
pi_i = argmax_k D_ik
The decoding matrix $D_{ik}$ also gives us a useful measure of the uncertainty of the labeling at each position. We might prefer to only infer labels at positions where $D_{ik}$ is high and over some chosen threshold, leaving other uncertain positions uninferred. We might also choose to visualize the uncertainty of our inferred labellings by some sort of heat map visualization of the decoding matrix $D_{ik}$.
Parameterizing HMMs
Where do the HMM parameters come from in the first place?
Sometimes we know enough about a problem to directly specify our probability parameter values. This is the case in the homework problem this week, for example, where each state of our HMM generates DNA residues according to the residue composition of a different phage genome, and these compositions are known from the phage genome sequences. The state transition probabilities can also be set sensibly from specifying the expected length of a segment of sequence that is generated by the same state. When HMMs model a stretch of consecutive identical labels by using a self-transition $t_{kk}$, this implicitly assumes that the length of that stretch is geometrically distributed, with mean length $\frac{1}{1-t_{kk}}$.
Other times, we might have labeled sequence data as ground truth, where the state paths are considered to be known. For example, we might have a multiple sequence alignment, where the columns of the alignment have been aligned by whatever the sequence annotation is: by conserved positions of a sequence profile, or by the positions of a PWM-like model. Then parameterization consists of taking the set of observed sequences and given state paths, collecting observed counts for state transitions and residue emissions, and converting counts to probabilities by the usual means.
What if we don't have a good intuition for setting probability parameters directly (as would often be the case for all but the simplest models), and we don't have labeled ground truth data to collect counts from?
HMMs can be trained on a set of unlabeled sequences $X$. Just as we saw last week in the case of motif finding with an ungapped PWM, we have a chicken-and-egg problem. If I knew the parameters $\theta$, I could use Forward, Backward, and Decoding to infer the probability of the ensemble of state paths for each sequence. If I know the probabilities of the ensemble of state paths, I can estimate expected counts and use those to estimate probability parameters. I have another expectation-maximization (EM) problem. In the particular case of HMMs, the algorithm is called the Baum-Welch algorithm, or Baum-Welch EM.
parameter estimation by Baum-Welch EM
We've already seen posterior decoding for obtaining the probability that state $k$ generates residue $x_i$. To get expected counts $E_k(a)$, we sum decoding probabilities over all positions $x_i = a$, and over all the sequences $x^s$ in our training set of sequences:
$$ E_k(a) = \sum_s \sum_{ i : x_i^s = a} D_{ik} $$
The crazy notation $i: x_i^s = a$ means "for all i such that $x_i^s$ is an a": we're trying to get the counts of just residue $a$, so sum over all the positions $x_i$ that are that residue. Another notational trick you'll see in this sort of case is $\sum_i \delta(x_i = a) D_{ik}$, where the function $\delta(x_i = a)$ is 1 when its condition is true and 0 if not. This is called a "Kronecker delta".
The emission counts are easy; we already had posterior decoding. Now for the state transition counts.
To get expected state transition counts $T_{kj}$, we have to be one step more creative about using the Forward and Backward prefix and suffix probabilities, compared to how we did posterior decoding. We need to calculate the marginal probability of the sequence summed over all paths that use transition $t_{kj}$ at a given position $i$. That will be the probability of the prefix $x_1..x_i$ conditional on being in state $k$ for $x_i$, times the probability $t_{kj}$ to get to state $j$ at position $i+1$, times the probability of suffix $x_{i+1}..x_L$ conditional on being in state $j$ at $x_{i+1}$. Because of the off-by-one issue in Backward, where the Backward variable $B_{i+1,k}$ hasn't assessed the emission probability for $x_{i+1}$ yet, we'll need to include that emission probability too. We get:
$$ \log P(x, \pi_i = k, \pi_{i+1} = j \mid \theta) = F_{ik} + \log t_{kj} + \log e_j(x_{i+1}) + B_{i+1,k} $$
Then we can use Bayes to calculate the posterior $P(\pi_i = k, \pi_{i+1} = j \mid x, \theta)$ analogous to how we did posterior decoding:
$$ P(\pi_i = k, \pi_{i+1} = j \mid x, \theta) = \exp\left( F_{ik} + \log t_{kj} + \log e_j(x_{i+1}) + B_{i+1,k} - \log P(x \mid \theta) \right) $$
and then sum over all positions $i$ and all training sequences $x^s$ to get expected counts:
$$ T_{kj} = \sum_s \sum_i \exp\left( F^s_{ik} + \log t_{kj} + \log e_j(x_{i+1}) + B^s_{i+1,k} - \log P(x^s \mid \theta) \right) $$
Thus collected, the expected counts $E_k(a)$ and $T_{kj}$ are used to estimate new probabilities $e_k(a)$ and $t_{kj}$ in the usual way, either by simply normalizing to frequencies, or by first adding pseudocounts (i.e. assume a Dirichlet prior for the parameters).
These steps of estimating the probabilities of the unknown state path ensemble given parameters, then parameters given counts collected from the current estimate of the state path ensemble probabilities, is one iteration of Baum-Welch EM.
Baum-Welch EM typically converges slowly. It is also quite prone to local optima, so it needs to be run from many different initializations. And for this reason, the homework doesn't ask for any Baum-Welch EM. I've been unable to find an interesting problem in phage genomes that would work in reasonable compute time.
Further reading
L Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition". Proc. IEEE (1989).
SR Eddy, "What is a hidden Markov model?". Nature Biotechnology (2004).
R Durbin, SR Eddy, A Krogh, G Mitchison, "Markov chains and hidden Markov models", chapter 3 of Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press (1998). I feel a little guilty for not having good figures to go with this week's lecture, especially since HMMs are graphical models. So I included a PDF of a book chapter that follows a similar outline as the lecture notes, with similar notation (no surprise, note the author list!), which has accompanying figures. The PDF is from our publisher page proofs - so ignore the marginal comments that are from us authors communicating with the typesetters.