week 13: phylogenetic trees

And this our life, exempt from public haunt, finds tongues in trees, books in the running brooks, sermons in stones, and good in everything. I would not change it.
- Shakespeare, As You Like It

Motivation

The idea that species in nature can be organized in a tree-like hierarchy is older than the theory of evolution, going back at least to Aristotle. Linnaeus was a creationist who believed he was revealing a divine order. Evolution provides an even stronger reason to expect species (and genes) to be related by phylogenetic trees.

Early phylogenetic trees were inferred intuitively and subjectively, usually by comparison of anatomical characters. The first objective numerical methods were introduced by Willi Hennig in 1950, enabled by the development of computers.

W Hennig. Phylogenetic Systematics U. Illinois Press. Original (1950); translated to English (1964).

E Zukerkandl and L Pauling. Molecules as documents of evolutionary history. J. Theoret. Biol. 8:357-366 (1965).

Zuckerkandl & Pauling wrote a series of prescient papers in the 1960's in which they pointed out that molecular sequence data (whether proteins or DNA) would soon provide an enormous number of characters on which to base phylogenetic tree constructions. They envisioned that these data would be so powerful that sequence comparison and tree inference would enable us to infer the evolutionary history of life on Earth, even for creatures that have left no trace in the fossil record, possibly all the way back to the origin of life. They and others imagined that a day would come when genome sequence data would be comprehensively available. We now live in that once-utopian day.

Tree-ology

We have some extant taxa that share evolutionary ancestry, all descended from a common ancestor in the past. A phylogenetic tree describes the evolutionary history and relationships of a set of taxa as a branching process in time.

Phylogenetic trees can be used to describe DNA/protein sequences, or species, or indeed any sort of things that evolve such as human languages. We will talk here almost exclusively about phylogenetic trees of sequences, so "taxa" and "sequences" are used interchangeably.

The nodes of the tree are the observed sequences (leaf nodes) plus the ancestral sequences (internal nodes). Phylogenetic trees are typically assumed to be binary, so each ancestor gives rise to two descendants, connected to them by branches (also called edges of the tree).

Since the tree is describing an evolutionary process over time, it makes sense to depict a rooted tree, with the last common ancestor at the root and the extant sequences at the leaves. Phylogenetic trees are conventionally drawn upside-down with their root at the top, and the arrow of time pointed down.

Typically, only the extant sequences at the leaves are observed data. The task of phylogenetic inference is to infer the unknown phylogenetic tree (perhaps including the unknown ancestral sequences, and unknown branch lengths) from the observed sequences.

We will see below that there are three main classes of tree inference methods: distance, parsimony, and probabilistic methods. Probabilistic methods are usually first introduced as maximum likelihood methods, but Bayesian methods also exist.

species trees; paralogs vs. orthologs

We are often interested in inferring a phylogenetic tree of species from a phylogenetic tree of sequences, not just in a tree of sequences for its own sake.

For this, it is important to choose sequences that have diverged by speciation, as opposed to duplication. For example, $\alpha$ and $\beta$ hemoglobins arose from an ancient globin gene duplication that predates the vertebrates. Any $\alpha$ hemoglobins are more similar to each other than to $\beta$ hemoglobins, because divergences within $\alpha$ and $\beta$ lineages are more recent than the duplication. If I make a tree of bat $\alpha$, bird $\alpha$, and mouse $\beta$ hemoglobin, I would find that the bat and bird sequence are closer to each other than to the mouse sequence, but I would not want to conclude that bats (as a species) are closer to birds than to mice.

Pairs of sequences that diverged by speciation are called orthologs. Pairs that diverged by duplication are called paralogs. Inferring a species tree is most straightforward if we use only orthologous sequences, where all the nodes of the sequence tree are speciation events, and there is a 1:1 correspondence of sequence tree and species tree. If the sequence tree contains paralogs and duplication events, then we need to additionally infer which nodes on the sequence tree correspond to speciation events versus duplication events, as part of deducing the species tree.

indexing; rooted vs. unrooted trees

A rooted binary tree for $n$ taxa has $2n-1$ nodes ($n$ leaf nodes for the taxa, plus $n-1$ internal nodes) and $2n-2$ edges. It is easy to work this out recursively: starting from $n$ nodes for the extant taxa, we join two nodes at a time to create one new node and two new edges, which takes us $n-1$ steps to join all nodes together.

We will number the taxa $0..n-1$, and the internal nodes $n..2n-2$, with the root at $2n-2$. Children $w,y$ have indices smaller than their parent $z$: $w,y < z$. This property results from a postorder traversal of the tree. Tree algorithms work up or down the tree, and convenient to be able to traverse the tree from leaves to root or root to leaves simply by looping from $0$ to $2n-2$ or $2n-2$ down to $0$.

Most tree inference algorithms are unable to infer the position of the root, even as they infer an overall branching pattern. This makes it important to be able to talk about unrooted trees. An unrooted tree has one fewer node and one fewer edge than a rooted tree for the same number $n$ of taxa: $2n-2$ nodes and $2n-3$ edges.

Unrooted trees are often still stored in implementations as rooted trees, since this is convenient for traversing them, and the placement of the root does not matter to parsimony or probabilistic approaches.

To infer the root of an unrooted tree, an outgroup is used. An outgroup is one or more taxa that are assumed to be more divergently related to the rest, so the root lies on the branch that connects to the outgroup.

astronomical number of trees

As $n$ grows, the number of possible trees grows astronomically. For n=3, there is one unrooted tree. That tree has $2n-3 = 3$ edges where we can attach a fourth taxon, so for n=4, there are 3 unrooted trees; then $2n-3 = 5$ new edges to attach a fifth taxon, so 15 unrooted trees for n=5 taxa. The number of trees for $n$ taxa is thus $(3) \cdot (5) \cdot \ldots \cdot (2n-5)$, which is written as a so-called "double factorial" $(2n-5)!!$ (where the product series moves in steps of two, instead of one with the usual factorial). An unrooted tree with $2n-3$ edges can be rooted along any of those edges, so there are $2n-3$ times as many rooted trees as unrooted trees, thus $(2n-3)!!$ rooted trees.

For $n=20$, there are already over $10^{20}$ possible unrooted trees. Two of the main classes of phylogenetic inference methods, parsimony methods and probabilistic methods (whether maximum likelihood or Bayesian) work by searching over "all" possible trees, and assigning each tree a cost or probability for explaining the observed data. A complete search of the space of all possible trees is infeasible even for relatively small $n$. These methods must use clever heuristics for restricting the number of trees that are evaluated.

branch lengths

The branches of the tree typically have different lengths, representing evolutionary distance between an ancestor and descendant. When the nodes of the tree are associated to actual historical time (perhaps by consideration of the fossil record), branch lengths may be in units of (millions of) years. Sequence trees more often have branch lengths measured in substitutions per site, an inferred number of residue substitutions that occurred at each position of the sequence.

A molecular clock assumption is the assumption that the rate of substitution has been constant in all lineages and at all times in the past, which would mean that time in years and distance in substitutions/site are directly proportional, and thus interconvertible -- which would be great, if it were true, and it generally isn't. Sequences typically evolve at different rates in different lineages.

pairwise distances

Just from one pair of extant sequences $i,j$ in our input, we can estimate the evolutionary distance $d_{ij}$ that separates them, in substitutions/site. We could then calculate an $n \times n$ symmetric distance matrix $d_{ij}$ for all pairs of sequences, with $d_{ii} = 0$ on its diagonal. Given a pairwise distance matrix, we can use a class of methods called distance methods to infer a tree.

Simply counting the number of observed substitutions in a pairwise alignment is not sufficient, because it will underestimate the actual number of substitutions that have occurred. As sequences accumulate substitutions, some substitutions will be reversions: C will change to G and back to C. This is called saturation, especially in the limit of a large or infinite number of substitutions, as two DNA sequences asymptotically appear to be about 25% identical to each other.

We can correct for this saturation effect using probability models of the substitution process, where we use a rate matrix $\pmb{R}$, describing an instantaneous rate of evolutionary events, to calculate a substitution matrix $\pmb{P}$ whose terms $P(b \mid a,t)$ are the probability of observing descendant residue $b$ aligned to ancestral residue $a$ after evolutionary time $t$ has passed on the branch that relates them. Given a number of observed substitutions, we can use this model to infer a maximum likelihood divergence "time" in substitutions/site.

If this correction is accurate, then distances on the tree become additive. A tree obeys additivity when every pairwise distance $d_{ij}$ is equal to the sum of the branch lengths in the path that connects taxa $i,j$ in the tree. If A evolves to B over distance $b_1$, and B to C in $b_2$, then $b_1 + b_2$ substitutions/site occurred from A to C.

Even stronger is the ultrametric condition for a tree, when the data are not only additive but also, for any given trio of taxa $i,j,k$ and their distances $d_{ij}$, $d_{jk}$, $d_{ik}$, two of the distances are equal and the third is less than (or equal to) those two. That is, an ultrametric tree is where all the taxa are equidistant from the root (and, recursively, any given clade of taxa is equidistant from its common ancestor). Ultrametric trees are always additive, but additive trees are generally not ultrametric. Trees of extant species with branch lengths in units of years are ultrametric. When branch lengths are measured in substitutions/site, trees are generally not ultrametric, unless a molecular clock assumption holds.

Given perfect ultrametric distances, we could provably and trivially reconstruct the correct tree topology by a simple clustering procedure called UPGMA. Given perfect additive distances, we can provably and almost as trivially reconstruct the correct tree topology by a procedure called neighbor-joining (NJ).

Probability models of time-dependent residue substitution

Both distance methods and probabilistic methods rely on a probability model of the residue substitution process given a divergence time $t$, $P(b \mid a,t)$.

Let's state some (massive) simplifying assumptions now, so we can introduce the basic ideas clearly.

  • The input sequences $X_i$ are aligned in a multiple sequence alignment (MSA).
  • There are no insertions or deletions. The MSA is ungapped.
  • The only evolutionary event is a single residue substitution.
  • The probability of substituting residue $b$ for ancestral residue $a$ only depends on the time $t$, and is independent of any other sequence context.
  • The substitution model is the same everywhere on the tree.

Under these simplifying assumptions, the probability of a descendant sequence $X_j$ from an ancestor $X_i$ of lengths $L$ is just a product over all aligned positions $u = 0..L-1$:

$$ P(X_j \mid X_i, t) = \prod_u P(x_{ju} \mid x_{iu}, t) $$

There is a rich literature on methods that relax one or more of these assumptions, but let's start here. And indeed, let's start with the simplest of all of the time-dependent substitution models, the Jukes-Cantor model, where substitution rates are assumed to be the same for all residues.

Jukes-Cantor model

TH Jukes, CR Cantor. Evolution of protein molecules. In: Mammalian protein metabolism, Volume 3, HN Munro, ed., pages 22-132. Academic Press (1969).

Let's assume that the rate of substitution of any residue $a$ to any other residue $b$ is a constant $\alpha$. Then for the four residues of DNA, we are assuming a 4x4 rate matrix that looks like:

$$ \pmb{R}_{\text{JC}} = \qquad \begin{matrix} ~ & \begin{matrix}\text{A}\quad & \text{C}\quad & \text{G}\quad & \text{T} \end{matrix} \ \begin{matrix} \text{A}\ \text{C}\ \text{G}\ \text{T} \end{matrix} & \begin{pmatrix} -3\alpha & \alpha & \alpha & \alpha \ \alpha & -3\alpha & \alpha & \alpha \ \alpha & \alpha & -3\alpha & \alpha \ \alpha & \alpha & \alpha & -3\alpha \ \end{pmatrix} \end{matrix} $$

The rows are $a$, columns are $b$. Rows have to sum to zero.

At an infinitesimally short time $\epsilon$ (where no double substitutions occur at the same position), the substitution matrix $P(t)$ with entries $P(b \mid a, t)$ is $\simeq I + R\epsilon$, where $I$ is the identity matrix (ones down the diagonal, zeros elsewhere). If we already have $P(t)$ and we extend it to time $t+\epsilon$, we can use a property called multiplicativity of time-dependent Markov models to get $P(t + \epsilon) = P(t)P(\epsilon) = P(t)(I + R\epsilon)$. Algebraic rearrangement of that gives us $P'(t) = P(t) R$, where $P'(t)$ is the first derivative of $P(t)$ with respect to time. This is a little first-order differential equation that's pleasing to solve. See Chapter 8 of Durbin et al. Biological Sequence Analysis if you're interested in more detail on how it's solved.

The solution is a matrix exponential $P = e^{tR}$, which is symmetric:

$$ \pmb{P}_{\text{JC}} = \qquad \begin{matrix} ~ & \begin{matrix} \text{A} & \text{C} & \text{G} & \text{T} \end{matrix} \ \begin{matrix} \text{A} \ \text{C} \ \text{G} \ \text{T} \end{matrix} & \begin{pmatrix} r_t & s_t & s_t & s_t \ s_t & r_t & s_t & s_t \ s_t & s_t & r_t & s_t \ s_t & s_t & s_t & r_t \ \end{pmatrix} \end{matrix} $$

with:

$$ \begin{eqnarray} r_t & = & \frac{1}{4} + \frac{3}{4} e^{-4 \alpha t} \qquad\qquad \text{(identities)} \ s_t & = & \frac{1}{4} - \frac{1}{4} e^{-4 \alpha t} \qquad\qquad \text{(substitutions)} \end{eqnarray} $$

At infinite time $t = \infty$, both $r_t$ and $s_t$ asymptote to 0.25: sequences are fully saturated and maximally dissimilar. At zero time $t = 0$, $r_t$ is 1 and $s_t$ is 0: sequences are identical.

Jukes-Cantor distances

Now suppose we are given an aligned pair of sequences of length $L$ and we count how many substitutions $n_s$ and identities $n_i$ we observe in that alignment. $n_i + n_s = L$; the fractional pairwise alignment identity is $\frac{n_i}{L}$. Define the fractional difference as $f = \frac{n_s}{L}$. Since the $P(b \mid a,t)$ terms of the Jukes-Cantor model tell us the expected fraction of observed $a$ to $b$ substitutions over time $t$, we can invert it to calculate the time, given the observed fractional difference.

Under Jukes-Cantor, the expected fractional difference $f$ is 3 times $s_t$:

$$ f = 3 \left( \frac{1}{4} - \frac{1}{4} e^{-4 \alpha t} \right) $$

The 3 is in there because the expected number of substitutions to one particular $b$ is $\alpha t$, so the expected number of substitutions to any of the three other $b$'s is $3 \alpha t$.

We want to estimate a distance $d$ between the two sequences that is the actual number of substitutions/site that occurred. That's $d = 3 \alpha t$.

Solving for $3 \alpha t$ gives:

$$ d = 3 \alpha t = - \frac{3}{4} \ln \left( 1 - \frac{4}{3} f \right) $$

That's the Juke-Cantor distance, converting an observed fractional difference $f$ for a pairwise alignment to an evolutinary distance $d$. Given an input multiple sequence alignment of $n$ sequences, we can use this to calculate an $n \times n$ distance matrix $d_{ij}$ for all pairs $(i,j)$. You get the same result if you find a maximum likelihood $\alpha t$ given the observed counts of identities and substitutions and use it to get $d = 3 \alpha t$.

The Jukes-Cantor distance becomes undefined for $f \geq 0.75$, because of the log of zero or a negative number. In an implementation, you have to handle this somehow. It makes sense to set $d_{ij}$ to $\infty$ in such a case, but this will tend to cause other numerical problems. In the pset, I set $d_{ij}$ to a large (but finite) number.

Kimura (1980) model

The problem set this week uses the JC69 model, not the K80 model. This section is for edification...

The Jukes-Cantor model doesn't allow different rates for different types of substitutions, such as transitions (purine to purine, pyrimidine to pyrimidine) versus transversions (purine to pyrimidine or vice versa), and it forces the stationary distribution for DNA residue composition to uniform 0.25, rather than allowing for different GC compositions. There are a wide variety of other probability models for time-dependent residue substitution of DNA and protein. Let's see just one more important DNA model, as an additional example.

M Kimura. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111-120 (1980).

The Kimura (1980) two-parameter model, affectionately abbreviated K80, allows for different transition and transversion rates, $\alpha$ and $\beta$:

$$ \pmb{R}_{\text{K80}} = \qquad \begin{matrix} ~ & \begin{matrix} \text{A}\quad\qquad & \text{C}\quad\qquad & \text{G}\quad\qquad & \text{T} \end{matrix} \ \begin{matrix} \text{A}\ \text{C}\ \text{G}\ \text{T} \end{matrix} & \begin{pmatrix} -2 \beta - \alpha & \beta & \alpha & \beta \ \beta & -2 \beta - \alpha & \beta & \alpha \ \alpha & \beta & -2 \beta - \alpha & \beta \ \beta & \alpha & \alpha & -2 \beta - \alpha \ \end{pmatrix} \end{matrix} $$

The solution $P = e^{tR}$ is:

$$ \pmb{P} = \qquad \begin{matrix} ~ & \begin{matrix} \text{A} & \text{C} & \text{G} & \text{T} \end{matrix} \ \begin{matrix} \text{A} \ \text{C} \ \text{G} \ \text{T} \end{matrix} & \begin{pmatrix} r_t & s_t & u_t & s_t \ s_t & r_t & s_t & u_t \ u_t & s_t & r_t & s_t \ s_t & u_t & s_t & r_t \ \end{pmatrix} \end{matrix} $$

with terms:

$$ \begin{eqnarray} r_t & = & \frac{1}{4} + \frac{1}{4} e^{-4 \beta t} + \frac{1}{2} e^{-2 (\alpha + \beta) t} \ s_t & = & \frac{1}{4} - \frac{1}{4} e^{-4 \beta t} \qquad\qquad\qquad\quad \text{(transversions)} \ u_t & = & \frac{1}{4} + \frac{1}{4} e^{-4 \beta t} - \frac{1}{2} e^{-2(\alpha + \beta)t} \qquad \text{(transitions)} \ \end{eqnarray} $$

Kimura distances

Given a pairwise sequence alignment, let $f_s$ be the fraction of sites with observed transitions, and $f_v$ be the fraction of sites with observed transversions. Then we can solve for $d = (2\beta + \alpha) t$ and get:

$$ d_{\text{K80}} = - \frac{1}{2} \ln(1 - 2f_s - f_v) - \frac{1}{4} \ln(1 - 2f_v) $$

You can spot check that this makes sense by comparing it to the Jukes-Cantor model. If $\alpha = \beta$ then the expected $f_v = 2 f_s$, and the fractional difference $f = f_v + f_s$. Variable substitution and algebra show that the K80 distance then simplifies to the Jukes-Cantor distance.

Distance methods

Now we've got the ability to calculate a pairwise distance matrix $d_{ij}$ for all our sequence pairs $i,j$. This enables a class of tree inference methods called distance methods.

UPGMA clustering

The simplest distance method is actually a method taken from cluster analysis, called UPGMA, which stands for "unweighted pair group method with averaging".

The algorithm works by agglomerating sequences into clusters, starting from initialized "clusters" of single sequences and iteratively joining the closest two clusters into a new one. "Close" is defined by a distance between two clusters $C_{i}$ and $C_{j}$ which is the average distance of all the sequence pairs between them:

$$ d_{ij} = \frac{1}{\vert C_i \vert \vert C_j \vert} \sum_{w \in C_i, y \in C_j} d_{wy} $$

(This is "unweighted" because all sequence pairs $w,y$ have the same weight in this average distance between clusters.)

You can then show that if we're joining (union-ing) two clusters $C_i,C_j$ into a new one $C_k$, we don't need to go all the way back to the individual sequences to calculate new distances from our newly union-ed cluster $C_k$ to all the other remaining clusters $C_m$. We can more efficiently just do:

$$ d_{km} = \frac { d_{im} \vert C_i \vert + d_{jm} \vert C_j \vert} {\vert C_i \vert + \vert C_j \vert} $$

In pseudocode, the UPGMA algorithm is:

Initialization:
    S   = { 0, ..., n-1}                  # set of all n leaf nodes 0..n-1
    C_i = { i } for all leafs i = 0..n-1  # assign each leaf taxon to its own cluster

Iteration:
  while |S| > 2:
    i,j = argmin_{w<y in S} d_wy
    Remove i,j from S
    Create new node k that joins i,j, at height d_ij / 2
    C_k = C_i & C_j     # union of sets i,j
    d_km = (d_im |C_i| + d_jm |C_j|) / (|C_i| + |C_j|) for all m in S
    Add k to S

Termination:
   Join final two clusters i,j with root at height d_ij / 2

In the pset this week, we only need to do one i,j = argmin_{w<y in S} d_wy step to know which of three trees is supported. I call this "mini UPGMA" in the pset.

UPGMA builds an ultrametric tree, with a root. All taxa are equidistant from the root. If the true tree actually is ultrametric, UPGMA is guaranteed to correctly reconstruct it. Branch lengths usually vary, though, as the speed of sequence evolution varies in different lineages, and we do not expect ultrametric trees.

Additivity, though, is not too unreasonable to expect, and it turns out there is a distance-based algorithm that is guaranteed to reconstruct additive trees correctly.

Neighbor-joining

M Saitou, M Nei. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Bio Evol. 4:406-425 (1987).

The Saitou-Nei neighbor-joining algorithm starts with the distance matrix $d_{ij}$ and calculates a second matrix $D_{ij}$ from it, which in essence subtracts a correction for the average distance from $i$ to all taxa and $j$ to all taxa:

$$ D_{ij} = d_{ij} - (r_i + r_j) $$

where

$$ r_i = \frac{1}{n-2} \sum_{k=0}^{n-1} d_{ik} $$

The reasoning behind why this definition of $D_{ij}$ works is not immediately obvious! If you draw an unrooted tree with two neighbors $i,j$ connected to node $k$ with branch lengths $a,b$, and the rest of the tree connected to $k$ by a branch of length $c$, you can carefully and laboriously show that the difference between the best and next best Saitou/Nei $D_{ij}$ is -c. That is, minimizing $D_{ij}$ corresponds to maximizing the length of the branch that separates the joined pair $i,j$ from the rest of the nodes. The "closest neighbors" in an additive tree are the pair that are jointly the farthest away from everyone else.

When we join two nodes $i$ and $j$ to form a new internal node $k$, we define new distances $d_{km}$ to all the remaining nodes $m$ as:

$$ d_{km} = \frac{1}{2} (d_{im} + d_{jm} - d_{ij}) $$

To understand this equation for $d_{km}$, draw an unrooted tree with two neighbors $i,j$ joined at a node $k$, and the rest of the tree connected to $k$. Label the branch lengths a,b,c, etc., and write the distances as additive sums of branch lengths $d_{ij} = a + b$, etc. You'll be able to see that the $d_{km}$ equation is correctly solving for the sum of branch lengths from $k$ to any other node $m$.

In pseudocode, the Saitou-Nei NJ algorithm is then:

Initialization:
    S   = { 0, ..., n-1}                  # set of all n leaf nodes 0..n-1

Iteration:
  while |S| > 2:
    i,j = argmin_{w<y in S} D_wy
    Remove i,j from S
    Create new node k that joins i,j
       set d_ik = 0.5 ( d_ij + r_i - r_j )
           d_jk = d_ij - d_ik
           d_km = 0.5 ( d_im + d_jm - d_ij ) for all other m
    Add k to S

Termination:
    Join the final i,j with length d_ij.

Critically, NJ will provably reconstruct an additive tree correctly, in the limit of perfect additivity and infinite data.

In the pset this week, we only need to do one i,j = argmin_{w<y in S} D_wy step to know which of three trees is supported. I call this "mini NJ" in the three trees pset.

Parsimony

Now let's step away from probabilities and distances for a while, and consider another widely used tree inference method: parsimony.

The central idea of parsimony is to find the tree that minimizes the number of evolutionary events needed to explain the observed sequences.

We need an algorithm that, given one particular tree, calculates the minimum number of evolutionary events needed. Then, we will search tree space, run that algorithm, and select the tree that has the minimum cost.

Parsimony does not use branch lengths, nor a model of the probability of different kinds of events. It just counts. Some people argue that this makes parsimony "assumption-free". In fact it can be shown to correspond to radical simplifying assumptions of a probability model, including an assumption that all branches are of equal length, and all events occur with the same time-dependent probability.

Fitch algorithm

W Fitch. Toward defining the course of evolution: minimum change for a specific tree topology. Syst. Biol. 4:406-416 (1971).

The basic parsimony algorithm is the Fitch algorithm. Given a multiple sequence alignment and a proposed tree with the sequences at the leaves, the Fitch algorithm looks at one aligned column at a time to calculate the cost (in number of substitutions) of that column, and sums those costs over all columns to get the total cost of the tree.

The algorithm for a single column is as follows. We work our way up the tree from leaves to root. At each node $z$, we keep a set of residues $S_z$ that represents the residues that are possible at this node in the most parsimonious assignment of residues to all nodes, including internal nodes. At each leaf, we simply initialize $S_z$ to the residue $x_z$ at that leaf. At an internal node, we look at the sets of the two children $w,y$. If the intersection of the two sets is not empty, then the most parsimonious assignment at $z$ is one of the residues in that intersection, with no cost and no substitutions needed down either child branch. If the intersection is empty, then we need to make one substitution down one of the branches; the most parsimonious assignment at $z$ is any of the residues in either child set (i.e. the union).

In pseudocode, the Fitch algorithm for a single alignment column is:

cost = 0

Initialization at leaves:
  for z = 0 to n-1:
    S[z] = x[z]       

Iteration up internal nodes to root:
  for z = n to 2n-2:
    w,y = left, right child of z
    if S[w] & S[y] not empty: S[z] = S[w] & S[y]
    else:                     S[z] = S[w] | S[y]; cost += 1

Return: cost

Weighted parsimony

This weeks's pset only uses the Fitch algorithm. The weighted parsimony algorithm is here for edification and because it makes a pleasing bridge to how the Felsenstein pruning algorithm works for maximum likelihood methods.

The assumption that all events have the same cost of 1 is pretty simplistic. A variant of the parsimony algorithm, weighted parsimony, assigns a cost $\sigma(b \mid a)$ to a substitution of ancestor $a$ to descendant $b$. Basic parsimony is then the case where $\sigma(b \mid a) = 1$ for all $a \neq b$ and $0$ for $a = b$.

D Sankoff. Minimal mutation trees of sequences. SIAM J. Appl. Math. 28:35-42 (1975).

The weighted parsimony algorithm starts to look more like a dynamic programming algorithm. At each node $z$, we keep an array $S_{z}(a)$ for each residue $a$, which is the minimum weighted cost of the subtree rooted at $z$ if the ancestral residue at $z$ is an $a$. We look at each possible residue $b$ in the left child $S_w(b)$ and choose the minimum of that plus the cost $\sigma(b \mid a)$ for the $a \rightarrow b$ substitution; then we add the same calculation for the right child.

Initialization at leaves:  
  for z = 0 to n-1:
    S[z,a] = 0 for a == X[z]
    S[z,a] = infinity for all other residues a

Iteration up internal nodes to root:
  for z = n to 2n-2
    w,y = left, right child of z
    S[z,a] = min_b S[w,b] + sigma(b \mid a) +
             min_c S[y,c] + sigma(c \mid a)

Return: min_a S[2n-2,a]       

We'll see that this is very much like how the Felsenstein pruning algorithm works for maximum likelihood methods.

Exploring trees

Parsimony methods do not construct a tree. They assign a cost to a proposed tree. To find the best tree, we need to search over the space of all possible trees. As mentioned, this is an astronomically large space.

Parsimony implementations use branch-and-bound methods: costs can only increase, so once any subtree already exceeds the cost for our best tree so far, we can stop, this tree can't be the best.

Parsimony implementations also use heuristics for exploring tree space that are similar in spirit to Markov chain Monte Carlo ideas, where a new tree is proposed by a random rearrangement of the current tree.

Maximum likelihood

Distance methods used a probabilistic model of residue substitution to estimate maximum likelihood distances $d_{ij}$ between pairs of sequences. We can also use the probabilistic model of residue substitution to specify a generative probability model of a phylogenetic tree.

prelude: probability of two sequences

To calculate the joint probability of an ancestral sequence $X$ giving rise to descendant sequence $Y$ at time $t$, we have a product over all their aligned sites $u$:

$$ P(X,Y \mid t) = \prod_u \pi(x_u) P(y_u \mid x_u, t) $$

Here the $\pi(x_u)$ terms are specifying that $X$ is an i.i.d. random sequence, and the $P(y_u \mid x_u, t)$ come from probabilistic model of time-dependent substitution, like Jukes-Cantor.

Most of the time-dependent substitution models that people use are reversible, which means that $\pi(a) P(b \mid a, t) = \pi(b) P(a \mid b, t)$, and that $\pi(a)$ is both the prior probability of $a$ and the stationary distribution of the time-dependent model at infinite time. Jukes-Cantor and K80 are reversible models, for example. Using a reversible model, we get the same probability if we consider $Y$ to be the ancestor and $X$ to be the descendant.

Now suppose $X,Y$ are both descendants of an unknown ancestral residue $a$, with branch length $t_1$ to $X$ and $t_2$ to $Y$. Now we need to sum over the uncertain identity of the root residue for each aligned position $u$:

$$ P(X,Y \mid t_1,t_2) = \prod_u \sum_a \pi(a) P(x_u \mid a, t_1) P(y_u \mid a, t_2) $$

Interestingly, this is the same probability as we obtained above with either $X$ or $Y$ themselves being the root. The time-dependent probability model is multiplicative: $P(c \mid a, t1+t2) = \sum_b P(b \mid a, t_1) P(c \mid b, t_2)$. This is why probability models (when using reversible substitution models) give unrooted instead of rooted trees. The likelihood is independent of the choice of root position. Felsenstein (1981) refers to this as the pulley principle.

Extended to multiple sequences: the Felsenstein pruning algorithm

J Felsenstein. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376 (1981).

Now suppose we have a multiple alignment $X$, a tree $T$, and branch lengths $t$ for $n$ sequences, not just 2, and we have a time-dependent substitution model $\theta$.

Because we are assuming that each column is independent, to calculate the likelihood $P(X \mid T,t,\theta)$, we will calculate a term $P(X_u \mid T,t,\theta)$ for each alignment column $u$ independently and multiply them all together (or sum their logs). So let's just worry about one aligned column $X_{ui}$ for now for the sequences $i$, and call those residues $x_i$ for each sequence $i$.

We don't know the ancestral residues at the internal nodes of the tree. As above for a pair of sequences descended from an unknown root residue, we will have to sum over all possible assignments of residues to internal nodes. That's 4^{n-2} possible configurations for a single unrooted tree.

Fortunately there is a recursive dynamic programming algorithm to solve the problem, known as the Felsenstein pruning algorithm:

Initialization at leaves:  
  for z = 0 to n-1:
    L[z,a] = 1 for a == x[z]
    L[z,a] = 0 for all other residues a

Iteration up internal nodes to root:
  for z = n to 2n-2
    w,y      = left, right child of z
    t_w, t_y = branch lengths to them
    L[z,a] = [ sum_b L[w,b] * P(b \mid a, t_w) ] *
             [ sum_c L[y,c] * P(c \mid a, t_y) ]

Return: 
    sum_a pi[a] L[2n-2,a]

This is essentially an extension of the calculation we did for a pair of sequences related by an unknown root ancestral residue, to run up the tree. We keep a variable $L_z(a)$ which is the probability of the subtree rooted at node $z$, with residue $a$ at that node. We calculate that probability iteratively, by looking at connecting ancestor $a$ to all the possible identities for residues $b$ at the left child node, multiplying in a branch probability $P(b \mid a, t)$ with the previously calculated probability $L_w(b)$, and doing the same for the right child $y$ and its unknown residue $c$.

At the end, we need to sum over all the possible ancestral residues at the root, multiplied by a prior $\pi(a)$. Again, the placement of the root is arbitrary, if the substitution model is reversible.

Optimizing (or summing out) branch lengths

The Felsenstein pruning algorithm required that we specified the branch lengths $t$ for our tree - but we have to infer those too.

The Felsenstein maximum likelihood inference algorithm then wraps the pruning algorithm (for particular branch length values) in an EM algorithm for estimating maximum likelihood branch lengths. The idea is that if I told you the branch lengths, you could run the pruning algorithm and estimate the probability distributions for the unknown residues at each ancestral node, for each aligned column in the sequence. If I told you all the unknown ancestral residues, you could count substitutions down every branch of the tree (summed over sites), and use those counts to estimate maximum likelihood branch lengths (distances) using your substitution probability model (Jukes-Cantor distances or whatever). But you don't know either - so you have an EM problem. Guess the branch lengths, and iterate back and forth to a maximum likelihood solution.

In the pset, I'm not having you do this. The branch lengths are given. But boy it sure was tempting, since you know EM well by now.

The likelihood is also differentiable with respect to all the branch lengths, obtaining a gradient that we can use to optimize branch lengths by gradient descent, quite analogous to how backpropagation works in neural network parameter estimation.

In a "Bayesian" approach, we sum over the unknown branch lengths using a prior, rather than maximizing likelihood with respect to them.

The Felsenstein "ML" algorithm is a quirky combination of summing over the uncertain ancestral residues while maximizing over branch lengths. A Bayesian would sum over both. A true maximum likelihood solution would maximize over the unknown ancestral residues (replacing sums with max's in the pruning algorithm) rather than summing over them. I haven't looked back at the literature to see if Felsenstein talked about this at all.

Exploring trees

Then finally, an ML (or Bayesian) approach needs to search tree space, much as parsimony methods do.

Fini

Congratulations - you have reached the end of MCB112! Thanks for all your hard work during the semester. Now go off and learn all this wonderful applied math and stats and computer science properly, now that you see how useful it is. I sure wish I had.

You care for nothing but shooting, dogs and rat-catching, and you will be a disgrace to yourself and all your family.
- Charles Darwin's father, removing him from Shrewsbury School at age 16


Our quest is at an end!
- Monty Python and the Holy Grail