By Daniel Eaton (11/8/2018)

You can also download this page in jupyter notebook format.

Probabilistic graphical models are graphs whose structure expresses the particular statistical dependency relationships of some set of random variables. We will be focusing on a subclass of graphical models called **Directed graphical models** (DGMs) or **Bayes nets**. In these graphs we have two types of objects:

**nodes**- which represent random variables**directed edges**- which imply conditional independence structure

The key concept in understanding the properties of DGMs is **conditional independence** (CI). We say that two random variables X and Y are conditionally independent given Z if:

$$ p (X,\:Y\:|\:Z) = p(X\:|\:Z)p(Y\:|\:Z) $$

What DGMs are doing is just specifying, for any given set of random variables, all of the conditional independencies that exist between them. We have already seen a model commenly expressed as a DGM, the markov model:

Remember, in (first-order) markov models, we could expand out the joint probability as follows:

$$p(x_{1},x_{2},x_{3}) = p(x_{1})p(x_{2}|x_{1})p(x_{3}|x_{1},x_{2})$$

$$ = p(x_{1})p(x_{2}|x_{1})p(x_{3}|x_{2})$$

through the markov property: $$p(x_{3}|x_{1},x_{2}) = p(x_{3}|x_{2})$$

We can get to this result using just the markov model DGM and the CI structure it implies. Once CI rule in DGMs is that, if there are three consecutive nodes (in the sense that their arrows point in one direction) then the first is conditionally independent of the third, given the second. In this case we have:

$$p(x_{1},x_{3}|x_{2}) = p(x_{1}|x_{2})p(x_{3}|x_{2})$$

Now consider $p(x_{3}|x_{1},x_{2})$ once more:

$$p(x_{3}|x_{1},x_{2}) = \frac{p(x_{1},x_{2},x_{3})}{p(x_{1},x_{2})}$$

$$ = \frac{p(x_{1},x_{3}|x_{2})p(x_{2})}{p(x_{1}|x_{2})p(x_{2})}$$

$$ = \frac{p(x_{1},x_{3}|x_{2})}{p(x_{1}|x_{2})}$$

but now from our CI assumption:

$$ = \frac{p(x_{1}|x_{2})p(x_{3}|x_{2})}{p(x_{1}|x_{2})}$$

$$ = p(x_{3}|x_{2})$$

Thus, we have shown how the markov property arises from the markov model DGM.

Here are the Bayes' balls rules:

Here are some statements from lecture that we can try to produce with DGM rules on this graphical model:

$$ P(X|Y,Z)=P(X∣Y) $$

if X and Z are conditionally independent because:

$$ P(X|Y,Z) = \frac{P(X,Y,Z)}{P(Y,Z)} = \frac{P(X,Z|Y)P(Y)}{P(Y|Z)P(Y)}$$

$$ = \frac{P(X,Z|Y)}{P(Z|Y)} = \frac{P(X|Y)P(Z|Y)}{P(Z|Y)} = P(X|Y)$$

Can we get the following statements?

$$P(S∣G,θ)=P(S∣G)$$

i.e. S and $\theta$ are conditionally independent given G

$$P(O∣S,G,θ)=P(O∣G)$$

i.e. O is conditionally independent of S and $\theta$ given G

$$P(R∣O,S,G,θ)=P(R∣O,S,G)$$

i.e. R is conditionally independent of $\theta$ given O, S and G

In order to see what an implementation of isoform abundance estimation by EM should look like, I will go over another generative model with some suggestive variable names.

You and your friends have recently come up with a (fairly convoluted) game. The game proceeds as follows:

1) Each player places a bet on an integer from 1 to 12, inclusive.

2) You roll a 6-sided die, producing a result, g. Note that this die may or may not be fair, more on this later.

3) Then, you grab a biased coin. The bias of the coin is specified by the previous roll and is $p(\textrm{heads}) = \frac{g}{6}$ Your collection of biased coins is comprehensive and will always have an appropriate coin.

4) Flip the selected coin 6 times and record the total number of heads, s.

5) Compute r = s + g. If someone bet on r, they win!

Before going any further, lets try to simulate this process.

In [32]:

```
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from scipy import stats
from scipy.misc import logsumexp
import itertools
```

In [33]:

```
def awesome_game_great_job(n,theta=np.array([1/6 for i in range(6)])):
"""Generates n results from the game defined above.
Args:
n (int): Number of plays of the game.
theta (array, optional): A length 6 array specifying the probabilities of rolling each face of the die.
Returns:
array: A length n array of game results
"""
output = []
for i in range(n):
Gi = np.random.choice(list(range(1,7)),p=theta)
Si = np.random.binomial(6,Gi/6)
Ri = Gi + Si
output.append(Ri)
return np.array(output)
```

In [34]:

```
print(awesome_game_great_job(10))
```

Great that was simple enough. Suppose now that you have a suspicion there might be something besides your precious coin collection that has a bias in this game. The die! If this were true, and you knew the bias, you could exploit it to win big! Let's begin by formalizing a statistical model to describe this process.

Defining our random variables:

$\boldsymbol{\theta}$ is a 6-dimensional vector describing the probability of rolling each of the six sides of the die

$G_{i}$ is the value of the die roll in the ith game

$S_{i}$ is the total number of heads flipped in the ith game

$R_{i}$ is the sum of $G_{i}$ and $S_{i}$ in the ith game

The story about how the game values were generated implied a particular conditional independence structure, illustrated in this bayes net:

This should look **very** familiar.

Finally, we know the conditional distributions of our random variables:

$$S_{i} \: | \: G_{i} \sim \textrm{Binomial}(n=6,p=\frac{G_{i}}{6})$$ $$G_{i} \: | \: \boldsymbol{\theta} \sim \textrm{Categorical}(\boldsymbol{\theta})$$ $$R_{i} \: | \: S_{i} ,\: G_{i} = S_{i} + G_{i}$$

Now, we will go through nearly the same steps as we did in lecture to develop an EM-based method to infer the $\boldsymbol{\theta}$ in this problem. I will go through this quickly, if you want more details, check out lecture:

In order to determine the MLE of the $\boldsymbol{\theta}$, we need the probability of our data given $\boldsymbol{\theta}$: $$p(\textbf{R}\:|\:\boldsymbol{\theta})$$

but we only know: $$p(\textbf{R}\:,\textbf{S}\:,\textbf{G}\:|\:\boldsymbol{\theta})$$

so, as before, we must marginalize out the unobserved (latent) variables in our problem:

$$p(\textbf{R}\:|\:\boldsymbol{\theta}) = \sum_{S} \sum_{G} p(\textbf{R}\:,\textbf{S}\:,\textbf{G}\:|\:\boldsymbol{\theta})$$

We saw in lecture that you could use conditional independencies to factorize the likelihood: $$p(\textbf{R}\:,\textbf{S}\:,\textbf{G}\:|\:\boldsymbol{\theta}) = p(\textbf{R}\:|\:\textbf{S}\:,\textbf{G})p(\textbf{S}\:|\:\textbf{G})p(\textbf{G}\:|\:\boldsymbol{\theta})$$

we have all of the equipment we need to define this likelihood, so let's do that here:

In [35]:

```
def g_given_theta(g,theta):
"""Compute p(g|theta)
Args:
g (int): Result from the die roll
theta (array): A length 6 array specifying the probabilities of rolling each face of the die
Returns:
float: The probability of a roll g given theta
"""
return theta[g-1]
def s_given_g(s,g):
"""Compute p(s|g)
Args:
s (int): Result from the coin flips
g (int): Result from the die roll
Returns:
float: The probability of a total number of heads s given g
"""
return stats.binom.pmf(s,n=6,p=g/6)
def r_given_s_g(r,s,g):
"""Compute p(r|s,g)
Args:
r (int): Result of the game
s (int): Result from the coin flips
g (int): Result from the die roll
Returns:
int: The probability of getting a result r given values s and g (either 0 or 1)
"""
return int(int(r) == int(s)+int(g))
def super_cool_probability(r,s,g,theta):
"""Compute p(r,s,g|theta)
Args:
r (int): Result of the game
s (int): Result from the coin flips
g (int): Result from the die roll
theta (array): A length 6 array specifying the probabilities of rolling each face of the die
Returns:
float: p(r,s,g|theta)
"""
probability = r_given_s_g(r,s,g)*s_given_g(s,g)*g_given_theta(g,theta)
return probability
```

Again, we will employ the EM trick of fixing $\boldsymbol{\theta}$ and finding $p(\textbf{G}\:|\:\textbf{R},\:\boldsymbol{\theta}^{t})$ through bayes' rule: $$p(\textbf{G}\:|\:\textbf{R},\:\boldsymbol{\theta}^{t}) = \frac{p(\textbf{R},\:\textbf{G}\:|\:\boldsymbol{\theta}^{t})}{\sum_{G}p(\textbf{R},\:\textbf{G}\:|\:\boldsymbol{\theta}^{t})}$$

Finally substituting the marginals: $$p(\textbf{G}\:|\:\textbf{R},\:\boldsymbol{\theta}^{t}) = \frac{\sum_{S}p(\textbf{R},\:\textbf{S},\:\textbf{G}\:|\:\boldsymbol{\theta}^{t})}{\sum_{G}\sum_{S}p(\textbf{R},\:\textbf{S},\:\textbf{G}\:|\:\boldsymbol{\theta}^{t})}$$

and the conditional dependencies: $$p(\textbf{G}\:|\:\textbf{R},\:\boldsymbol{\theta}^{t}) = \frac{\sum_{S}p(\textbf{R}\:|\:\textbf{S}\:,\textbf{G})p(\textbf{S}\:|\:\textbf{G})p(\textbf{G}\:|\:\boldsymbol{\theta}^{t})}{\sum_{G}\sum_{S}p(\textbf{R}\:|\:\textbf{S}\:,\textbf{G})p(\textbf{S}\:|\:\textbf{G})p(\textbf{G}\:|\:\boldsymbol{\theta}^{t})}$$

I will opt to compute this quantity in log space as an exercise, but it is not necessary in this case, given the small number of possible outcomes: $$\textrm{log}\:p(\textbf{G}=i\:|\:\textbf{R}=r_{n},\:\boldsymbol{\theta}^{t}) = \textrm{log}\sum_{S}p(r_{n}\:|\:\textbf{S}\:,\textbf{G}=i)p(\textbf{S}\:|\:\textbf{G}=i)\theta_{i}^{t} - \textrm{log}\sum_{G}\sum_{S}p(r_{n}\:|\:\textbf{S}\:,\textbf{G}=i)p(\textbf{S}\:|\:\textbf{G}=i)\theta_{i}^{t}$$

In [36]:

```
def p_R_G_given_theta(R,theta):
"""Compute p(R=r,G=g|theta) for all r,g
Args:
R (array): Array of game results
theta (array): A length 6 array specifying the probabilities of rolling each face of the die
Returns:
array: A len(R) by len(G) array storing values of p(R=r,G=g|theta)
"""
outarr = []
for r in R:
row = []
for g in range(1,7):
sum_over_s = []
for s in range(0,7):
sum_over_s.append(super_cool_probability(r,s,g,theta))
row.append(np.sum(np.array(sum_over_s)))
outarr.append(row)
return np.array(outarr)
def p_G_given_R_theta(R,theta,alpha=(10**-20)):
"""Compute p(G=g|R=r,theta) for all r,g
Args:
R (array): Array of game results
theta (array): A length 6 array specifying the probabilities of rolling each face of the die
alpha (TYPE, optional): Pseudocount for numerical stability
Returns:
array: A len(R) by len(G) array storing values of p(G=g|R=r,theta)
"""
numerator = np.log(p_R_G_given_theta(R,theta)+alpha)
denominator = np.expand_dims(logsumexp(numerator,axis=1),1)
log_probability = numerator - denominator
log_likelihood = np.sum(denominator)
return np.exp(log_probability), log_likelihood
```

In [37]:

```
R = np.array([12, 12, 12, 12, 12, 12, 12, 12, 12, 12])
# R = np.array([1,1,1,1,1,1,1,1,1,1,1,1])
pG_R_theta,_ = p_G_given_R_theta(R,np.array([1/6 for i in range(6)]))
print(pG_R_theta)
print(pG_R_theta.shape)
```

To use this conditional to compute the updated value of $\theta$, we will need to compute the expected number of occurances of each die roll in the dataset:

$$ c_{i} = \sum_{n} \:p(\textbf{G}=i\:|\:\textbf{R}=r_{n},\:\boldsymbol{\theta}^{t}) $$

Now, we may update $\theta$ by simply normalizing this value by the total number of occurances:

$$ \theta_{i}^{t+1} = \frac{c_{i}}{\sum_{j} c_{j}}$$

In [38]:

```
def compute_counts(G_giv_R_theta):
"""Compute ci
Args:
G_giv_R_theta (array): A len(R) by len(G) array storing values of p(G=g|R=r,theta)
Returns:
array: A len(G) array storing values of ci
"""
return np.sum(G_giv_R_theta,axis=0)
def new_theta(G_giv_R_theta):
"""Compute theta_i for the current iteration
Args:
G_giv_R_theta (array): A len(R) by len(G) array storing values of p(G=g|R=r,theta)
Returns:
array: A len(G) array storing values of theta_i
"""
counts = compute_counts(G_giv_R_theta)
return counts/np.sum(counts)
```

In [39]:

```
print(new_theta(pG_R_theta))
```

One last thing, we need a metric for convergence of the EM procedure. For this, we will just use the log likelihood, as before:

$$\textrm{log}\:p(\textbf{R}\:|\:\boldsymbol{\theta}^{t})$$

this happens to be identical to the denominator I used for normalization in the function `p_G_given_R_theta`

, with an added sum over the results $r_{j}$ so I just added an extra output back there.

Now we must just run EM until convergence.

In [40]:

```
R = awesome_game_great_job(200,theta=np.array([1/12,1/12,1/12,1/12,1/12,7/12]))
```

In [41]:

```
def EM(R,theta_0=np.array([1/6 for i in range(6)]),eps = 0.01):
"""Runs the EM procedure, until the difference between log likelihoods on consecutive iterations is less than eps.
Args:
R (array): Array of game results
theta_0 (array, optional): Inital value of theta
eps (float, optional): Convergence threshold
"""
theta = theta_0
eps_t = 1
old_log_likelihood = np.inf
counter = 0
while eps_t>eps:
counter+=1
G_giv_R_theta,log_likelihood = p_G_given_R_theta(R,theta)
theta = new_theta(G_giv_R_theta)
eps_t = abs(log_likelihood-old_log_likelihood)
old_log_likelihood = log_likelihood
print("Iteration " + str(counter) + ": " + str(eps_t))
print(theta)
```

In [42]:

```
EM(R)
```

Great so EM works. But it is sort of slow and that's no fun. Here is my attempt at making a much faster version of the above EM procedure using some tricks from the old numpy hat:

In [43]:

```
def G_given_theta(G,theta):
return theta[G-1]
def S_given_G(S,G):
output = []
for g in G:
output.append(stats.binom.pmf(S,n=6,p=g/6))
return np.array(output)
def construct_dirac_delta_masks(G,S):
cartesian_product = np.array(list(itertools.product(G,S))).reshape(len(G),len(S),2)
outer_sum = np.sum(cartesian_product,axis=2)
masks = np.array([i == outer_sum for i in range(1,13)])
return masks
def R_given_S_G(R,S,G):
masks = construct_dirac_delta_masks(G,S).astype(int)
return masks[R-1]
def p_R_G_given_theta_fast(R,S,G,theta):
probabilities = R_given_S_G(R,S,G)*np.expand_dims(S_given_G(S,G)*np.expand_dims(G_given_theta(G,theta),1),0)
return np.sum(probabilities,axis=2)
def p_G_given_R_theta_fast(R,theta,alpha=(10**-20)):
G = np.array([1,2,3,4,5,6])
S = np.array([0,1,2,3,4,5,6])
numerator = np.log(p_R_G_given_theta_fast(R,S,G,theta)+alpha)
denominator = np.expand_dims(logsumexp(numerator,axis=1),1)
log_probability = numerator - denominator
log_likelihood = np.sum(denominator)
return np.exp(log_probability),log_likelihood
def EM_fast(R,theta_0=np.array([1/6 for i in range(6)]),eps = 0.01):
theta = theta_0
eps_t = 1
old_log_likelihood = np.inf
counter = 0
while eps_t>eps:
counter+=1
G_giv_R_theta,log_likelihood = p_G_given_R_theta_fast(R,theta)
theta = new_theta(G_giv_R_theta)
eps_t = abs(log_likelihood-old_log_likelihood)
old_log_likelihood = log_likelihood
print("Iteration " + str(counter) + ": " + str(eps_t))
print(theta)
```

In [46]:

```
R = awesome_game_great_job(1000000,theta=np.array([1/12,1/12,1/12,1/12,1/12,7/12]))
```

In [47]:

```
EM_fast(R)
```

If those functions were confusing, don't worry, we wont expect your implementations to run that quickly :) BUT your code will need to run faster than my unoptimized example, which will scale very poorly to the much larger dataset in the homework. Hint: you will have to do something about that pesky loop over all of the results/reads r.