Week 07 section
Notes & code by Joshua Park (2024)
This week's lectures taught us about the expectation-maximization algorithm, a versatile algorithm that can be used to optimize parameters for statistical models where there are hidden, or "latent" variables involved.
Some useful notation:
$X$: the observed data
$Z$: the latent variables
$\Theta$: the parameters
For example, in this week's example of identifying sequence motifs, our data $X$ are the observed sequences, and our model's parameters $\Theta$ are the position weight matrix and background distribution. The model also depends upon the position of the motif, which is the latent variable $Z = \lambda ^S$.
Fundamentally, the expectation-maximization algorithm is trying to find parameters that allows the model to fit the best. In other words, the parameters that maximize the (log) likelihood.
$$ L(\Theta) = \log\ P(X | \Theta) $$
In practice, the likelihood often depends on unobserved variables. We can model these unobserved variables with their own probability distribution $P(Z|X,\Theta)$ that we'll update based on the data. We can then rewrite our likelihood expression to account for the latent variable as follows:
$$ \begin{align} Q(\Theta) &= \mathbb{E}_{Z}[\log\ P(X,Z| \Theta)] \ &= \sum_Z P(Z | X, \Theta )\ \ \log\ P(X, Z | \Theta) \end{align} $$
In practice, we know neither $\Theta$ nor $Z$. The expectation-maximization algorithm works iteratively by starting a $\Theta$. Then, the expectation step calculates $Z$ (or more precisely, $P(Z | X, \Theta)$ ), and then the maximization step that to find a better $\Theta$ that maximizes $Q(\Theta)$.
Jupyter notebooks: w07-section.ipynb and w07-section-answers.ipynb
Expectation
The goal of the expectation step is to calculate $Q(\Theta)$ conditioned on our current guess, $\Theta^{(t)}$. In other words, we want to be able to calculate
$$
\begin{align}
Q(\Theta | \Theta^{(t)} ) &= \mathbb{E}_{P(Z|X,\Theta^{(t)})}[\log\ P(X,Z| \Theta)] \
&= \sum_Z P(Z | X, \Theta^{(t)} )\ \ \log\ P(X, Z | \Theta)
\end{align}
$$
Practically, we just need to calculate $P(Z|X,\Theta^{(t)})$. For the sequence motif problem, we calculate $P(\lambda^S | X, \Theta^{t})$ for each possible $\lambda^S$ by using the $\Theta$ (PWM and background model) and normalizing by their sum.
Maximization
The maximization chooses the next $\Theta^{(t+1)}$ by finding the value that maximizes $Q(\Theta | \Theta^{(t)})$.
$$\Theta^{(t+1)} = \mathrm{argmax}\ Q(\Theta | \Theta^{(t)})$$
We can determine $\Theta^{(t+1)}$ analytically in terms of the $P(Z | X, \Theta^{(t)})$ we caclulated in the expectation step using calculus (Lagrange multipliers is a common method for this). For the sequence motif problem, this becomes the counts at each position weighted by $P(\lambda^S | X, \Theta^{(t)})$ and then normalized (see lecture notes).
K-means clustering
Now let's try applying the EM algorithm to another context--K-means clustering.
Let's say we are given a system where data is drawn from one of $k$ classes, each with unknown means $\mu_i$. Data is distributed normally with the same standard deviation $\sigma$. Given a set of observations drawn from this distribution, can we find the $k$ means and assign each datapoint to one of them? This problem can be extended to $N$ dimensions, such as with RNA-seq datasets.
We can start by defining the variables and parameters we're working with:
$X$: the set of observations ${x_i}$
$Z$: the class identity of each observation ${z} \in 1,2,...,k$
$\Theta$: the means of the $k$ classes $\mu_1, ..., \mu_k$
For our expectation step, we need to calculate $P(Z | X, \Theta)$. We can use the fact that each $x_i \in X$ is normally distributed with mean $\mu_j$ and standard deviation $\sigma$ to calculate it:
$$P(z_i = j, x_i | \Theta^{(t)}) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x_i - \mu^{(t)}_j)^2}{2\sigma^2}}$$
$$P(z_i = j | \Theta^{(t)}, x_i) = \frac{e^{-\frac{(x_i - \mu^{(t)}_j)^2}{2\sigma^2}}}{\sum_z e^{-\frac{(x_i - \mu^{(t)}_z)^2}{2\sigma^2}}}$$
A derivation for the maximization step can be found by optimizing the Q function. For convenience, I'll write the above $P(z_i = j | \Theta, x_i)$ as $w^{(i)}j$, the weight of data point $i$ being assigned to cluster $j$. The final maximization calculation eventually becomes:
$$
\mu^{(t+1)}_j = \frac{1}{\sum} w^{(i)j} \sum_j x_i
$$} w^{(i)