From Nathan Nakatsuka on 11/3/2017. Available as a Jupyter notebook page download or in your web browser

1) Review of Lecture

a) Probabilistic Graphical Models (Bayesian Networks)

i) Hidden Variables

ii) Generative vs. Discriminative Models

b) Expectation-Maximization Algorithm

c) Application of EM to Motif Discovery

2) Problem set walkthrough

a) Useful python functions

One of the goals of statistical inference is to determine, given you have some data, what are the properties or distributions of the underlying variables? We have been doing this throughout the course and will provide new methods for doing that this week. In week 7's section notes we said that 3 aspects of modeling are building models, fitting models, and testing models. This week's material touches on all 3 of these components. Probabilistic graphical models we will see are a good way to help us build models by expressing the dependencies of the variables in a visual format. We will then talk about the EM algorithm which will help us to fit models by allowing us to find maximum likelihood estimates of parameters given a particular model that has hidden variables. Lastly, we will see how we can test models by comparing the log likelihood scores of two models (Watson's vs. ours) and seeing which one better explains the data (which forms the basis for some model selection methods).

There are two major types of PGMs: Bayesian networks and Markov random fields. Bayesian networks are directed acyclic graphs (DAGs), whereas Markov random fields are undirected. In this problem set we know the underlying causal relationships between the variables, so we will be using directed graphs (Bayesian networks).

The purpose of PGMs are that they allow you to visualize the relationships amongst different variables and see how the joint distribution is composed in terms of which variables, when conditioned on the values of some variables, are independent of the other variables. This will allow you to see your assumptions of the model up front and will simplify your decomposition of the joint probability distribution. All of this will help you to buiild more accurate models of the process underlying the generation of your data.

In class Sean often talks about "generative" models. A generative model is one that models the full distribution underlying the different classes. The K-means algorithm was an example of this because we attempted to model the cluster centers and their distributions. In contrast, discriminative models only learn the boundaries between classes. Examples of discriminative models are support vector machines (SVMs) and decision trees like the random forest algorithm. Generative models can be better because they can generate new data once you know the model (simuated data), which we will do in this problem set.

See Figure 1 of the lecture notes for an example of a PGM. In this problem set we will not account for the orientation of the reads so you can ignore the O whenever it shows up in the problem set (pretend you are marginalizing it out). In PGMs the shaded variable is your observed information, in this case the read counts for each "nucleotide" (a,b,c, etc.). The variable of interest is $\theta$, which are the transcript abundances (transcript 1 composes x percent of the total transcripts, transcript 2 composes y percent, etc.), which can also be expressed as "nucleotide abundances". Try not to be confused by this naming, "nucleotide abundances" mean what is the likelihood that a random nucleotide came from transcript 1, transcript 2, 3, etc. (more on this later).

The goal here is to determine $\theta$ given R. The first thought would be to use Bayes rule, so P($\theta$|R) = P(R|$\theta$)*P($\theta$)/P(R). However, since we don't have known priors for $\theta$ we can simply maximize the likelihood P(R|$\theta$). Unfortunately, there are hidden variables that prevent us from being able to maximize this directly. The hidden varible in this case is what transcripts did each read come from? These are G (the choice of transcript) and S (the choice of segment). This is a problem due to the fact that there are overlapping reads of different "genes". Other possible problems could be sequencing error or overlapping reads of different isoforms due to alternative splicing, both of which we can ignore in this problem set. Nevertheless, to solve this problem we need to use the expectation-maximization algorithm.

We'll review here the example of motif finding. In motif finding the goal is to find the motif as a position-specific probability weight matrix. You are given sequence data from a Chip-Seq experiment where the protein (e.g. transcription factor) has bound to somewhere in the sequence. The probability weight matrix (PWM) is a 4xW matrix where the 4 rows correspond to A,G,T,C and the columns are the positions of the motif. The columns sum to 1 since it is a probability of each nucleotide at each site. The goal is to maximize the probability of the sequence data given parameters (PWM). However, the hidden variable here are the positions of the motif in the sequences. If you knew the positions, then you could get the PWM by just lining up the motifs for all sequences and counting up the different nucleotides at each position, then normalizing to get probabilities. On the other hand, if you knew the PWM, you could get the positions by looping through each possible position and calculating the likelihood that it is the motif's true position, then using these probabilities to get counts by multiplying the likelihoods/probabilities of that position by the nucleotides observed there and then normalizing the counts. Another option is to do Gibbs sampling, which is to randomly pick a position based on the probabilities (rather than averaging over all possible positions).

The EM algorithm for motif finding would be to initialize a PWM randomly, then keep switching between calculating probabilities of start positions and using the probabilities of the start positions to get an updated PWM and repeating until convergence.

See notes for more detail. The multinomial distribution is the binomial for more than 2 categories (and the binomial is the sum of many Bernoullis, which you can think of like coin flips, so the multinomial you can think of as rolls of many die). The Dirichlet is the conjugate prior for the multinomial. Note that the c in the exponent of the joint distribution of prior and multinomial is like adding pre-counts in to the distribution (a stronger prior is adding in more pre-counts). An example of this is using last year's baseball season results as a prior for the anticipated number of runs a batter will have for the upcoming season.

Throughout this problem set you will have to convert transcript abundances to "nucleotide" abundances. Transcript abundances are the proportion of transcripts in the transcriptome (10 percent transcript 1, 5 percent transcript 2, etc.). The "nucleotide" abundances are the likelihood given you have a nucleotide that it came from a particular transcript. Both are length 10 arrays. Longer trnascripts will contribute more nucleotides to the overall pool, so to go from transcript abundance to nucleotide abundance, you have to normalize for length (t*L/$\sum$t*L).

The goal here is to create a simulator where given a locus structure and transcript abundances that you define, you simulate read counts for each nucleotide. The final output should be an array of length 10 (since there are 10 possible "nucleotides", a,b,c...j).

The key for this is to look at the generative model and follow the path to get to reads. The following is some pseudocode:

In [1]:

```
# Convert transcript abundances to nucleotide abundances.
# Repeat N times (N=number of reads you want to generate):
# Choose transcript according to probabilities nucfrac (nucfrac=nucleotide abundances). You can use np.random.choice(len(T),p=nucfrac)
# Choose a segment on the transcript with equal probability for all segments (1/L[transcript])
# Account for circularity (% 10)
# Add 1 count to the nucleotide corresponding to the segment chosen
```

The goal here is to create a function that can go from readcounts, L (locus architecture), and t (transcript abundances) to get a log likelihood of the data (readcounts) given L and t. Readcounts here is a length 10 matrix of the number of counts for each nucleotide (a through j). This is P(R|t,L) = P(R|S,G)*P(S|G)*P(G|$\theta$). P(R|S,G)=1 because there is no sequencing error. P(S|G)=1/length of transcript. P(G|$\theta$) is the nucleotide abundance.

In [ ]:

```
# Pseudocode
# Convert transcript abundances to to nucleotide abundances
# For each nucleotide, keep track of the probability for that nucleotide.
# Loop through all nucleotides.
# For the given nucleotide, think: What is the probability it arose from a particular transript and segment?.
# Loop through all possible transcripts
# Loop through all possible segments for a given transcript
# Account for circularity
# If the segment corresponds to the current nucleotide: (equivalent of kronecker delta)
# Add to probability: P(choice of transcript)*P(choice of segment), where P(choice of transcript) is nucleotide abundance, and P(choice of segment) is 1/length of transcript.
# loglikelihood += log (prob^num)=num*log(prob), where num is the number of reads obseved for that nucleotide.
# (note: since you are treating each nucleotide as independent of the others, you can just multiply by the number of reads observed).
```

You should get Watson's estimates by copy and pasting from the given code in the probelm set and using that to get Watson's tau. You can use this Watson estimator function (turn the code into a function) to calculate log likelihoods for simulated reads from a particular L, t combination. You should test your simulator and log likelihood function on Sean's example data at the end to see if it's working.

The hidden variable here is what transcript did each of the reads come from? The E step is the expectation value of the log likelihood of the hidden variable values and the M step is maximizing the likelihood to get parameters that maximize the quantity (nucleotide abundances). For this problem you should just operate with nucleotide abundances during the EM and then convert to transcript abundances after convergence.

To initialize you can set a random nucleotide abundance values.

In [ ]:

```
# The strategy for the E step is similar to likelihood calculation. You have read counts, nucleotide abundances, and L.
#Pseudocode:
# Loop through all nucleotides.
# For the given nucleotide keep track of probabilities
# Loop through all possible transcripts
# Loop through all possible segments for a given transcript
# Account for circularity
# If the segment corresponds to the current nucleotide: (equivalent of kronecker delta)
# Add to probability: P(choice of transcript)*P(choice of segment), where P(choice of transcript) is nucleotide abundance, and P(choice of segment) is 1/length of transcript.
# Normalize probabilities.
# Multiply probabilities by their respective read count number to get expected counts.
# M step:
# Just normalize the counts (counts/np.sum(counts)) to get nucleotide abundances.
```

Go back and forth between E and M 1,000 times then convert the nucleotide abundances to transcript abundances using the formula (nucfrac/length)/($\sum$nucfrac/length).