MCB112: Biological Data Analysis (Fall 2017)

recitation section week 4: Probability and Markov Models

(Nathan Nakatsuka 09/28/2017 adapted from notes by Tim Dunn)

Laplace and Bayes made significant contributions to probability theory by hammering home the idea of inverse probability. This concept is important because it provides a foundation for solving the inverse problem, the typical inductive task we, as scientists, are faced with when analyzing data.

Deductive vs. inductive logic

Deduction is used when a cause (or model) is known, and we would like to know the resulting effects (or the output from the model). This type of reasoning is common in mathematics, where relationships and formulae can be deduced from given axioms. Deductive reasoning also allows us to predict the odds of, say, a blackjack hand, knowing the number of cards in a deck, the number of cards dealt to each player, number of face cards revealed, etc. In this blackjack example, the model is everything you know about the game, and the model is composed of individual model parameters (i.e. number of cards in a deck, number of players, etc.) that together define the output of the model. When a model is nice and parameterized, the parameters themselves can be updated, and a change in model output can be observed (e.g. how do the odds change as fewer and fewer cards remain in the deck?). Deduction is also known as the forward problem.


Induction is used to go the other way, determining the cause from many observations of the effects. This is the problem that biologists are typically trying to solve. Given some experimental data, what was the underlying cause of the data (i.e. the true nature of the underlying system)? This inverse problem is usually harder to tackle because often data are consistent with many different possible causes (or combinations of causes). In the blackjack example, let’s suppose that you see 10 hands dealt in a row. With that information, can you now determine whether the dealer is playing with a standard deck of cards?


Laplace’s inverse problem

Laplace faced a similar inverse problem when confronted with the London and Paris census data. Given the somewhat biased boy/girl ratio in the census data, how could he go about determining whether the actual birth rate was really 50:50 or actually something more like 51:49?

Laplace knew that he could solve the forward problem using the binomial distribution, which describes the probability of a certain number of “successes” in a string of independent binary (i.e. boy/girl) events (i.e. births).

Binomial distribution

To understand the binomial distribution, it often helps to think of many flips of a coin. In this case, each flip is independent of every other flip (independent events) and the outcome of each flip is binary (two possible outcomes: heads or tails; these binary flips are called Bernoulli processes and follow the Bernoulli distribution). The binomial distribution describes the probability (P) that a particular number of heads (H) occurs in a sequence of N flips. To illustrate these probabilities, we can build a simple table:

N possible outcomes H=0 H=1 H=2 H=3 H=4
1 {H, T} 1 1
2 {HH, TH, HT, TT} 1 2 1
3 {HHH, HHT, THH, THT, TTH, TTT, HTH, HTT} 1 3 3 1
N possible outcomes P(H=0) P(H=1) P(H=2) P(H=3) P(H=4)
1 {H, T} 1/2 1/2
2 {HH, TH, HT, TT} 1/4 2/4 1/4
3 {HHH, HHT, THH, THT, TTH, TTT, HTH, HTT} 1/8 3/8 3/8 1/8

Thus, you can see how H, and thus P(H) evolves and N changes. If you think a little bit about it, this table can be summarized as follows:

where , which can be shortened to , is the number of different ways that H heads can be ordered in a sequence of N flips.

This probability formula was specific to a fair coin, where the probability of heads (0.5) is equal to the probability of tails (0.5). It can be generalized to any weighted coin (or any binary process for that matter) with heads probability

To simplify everything, we can also reduce the unwieldy statement inside using the symbol ,

By setting and to some value, you can then plot for all possible values of , generating what is known as a probability mass function, which is like a probability density function but for discrete variables (there is no possible way to get a partial number of heads).

We could write our own function to implement this mass function, but Python has already done it for us:

scipy.stats.binom.pmf(H, N, p)

# import stats and plotting packages
import scipy.stats as stats
import matplotlib.pyplot as plt

%matplotlib inline
# %config InlineBackend.figure_formats = {'svg',}

# Let's have Python plot the last row of our table from above
p = 0.5
N = 4
H = [0, 1, 2, 3, 4]

# Use python's built-in binomial equation to generate a probability for each H
# Note that because we are passing in a list of H, the pmf function returns a list of calculated P for each H
mypmf = stats.binom.pmf(H, N, p)

# Plot the result
plt.plot(H, mypmf, 'ko-')
plt.ylabel('$P(H \mid N = 4, p = 0.5)$', fontsize=15)
plt.xlabel('$H$', fontsize=15)


Because Python has streamlined everything for us, we can also generate the same mass function, but over many more flips

# Let's do 1K flips
p = 0.5
N = 1000
H = list(range(0, N+1))

# Use python's built-in binomial equation to generate a probability for each H
# Note that because we are passing in a list of H, the pmf function returns a list of calculated P for each H
mypmf = stats.binom.pmf(H, N, p)

# Plot the result
plt.plot(H, mypmf, 'ko-')
plt.ylabel('$P(H \mid N = 10^{3}, p = 0.5)$', fontsize=15)
plt.xlabel('$H$', fontsize=15)


And we can start changing the value of $p$ to see the mass function for a biased coin, which, most noticeably, changes the position of the curve

# Let's make the probability of heads p = 0.2
p = 0.2
N = 1000
H = list(range(0, N+1))

# Use python's built-in binomial equation to generate a probability for each H
# Note that because we are passing in a list of H, the pmf function returns a list of calculated P for each H
mypmf = stats.binom.pmf(H, N, p)

# Plot the result
plt.plot(H, mypmf, 'ko-')
plt.ylabel('$P(H \mid N = 10^{3}, p = 0.2)$', fontsize=15)
plt.xlabel('$H$', fontsize=15)


These plots illustrate how one can start with a model (the binomial distribution) and known model parameters ( and ) and solve the forward problem. Namely, what’s the likelihood that I see a certain number of heads given and .

Laplace had (The total number of births) and (in his case the total number of boys, which we’ll call ). Because he ultimately wanted to know whether the true birth rate was 50:50 based on these data, the first thing he did was solve the forward problem with and , checking the probability that = 251527. Here’s the pmf for :

# Let's use Laplace's parameters
p = 0.5
N = 493472
b = list(range(0, N + 1))

# Use python's built-in binomial equation to generate a probability for each H
# Note that because we are passing in a list of H, the pmf function returns a list of calculated P for each H
mypmf = stats.binom.pmf(b, N, p)

# Plot the result
plt.plot(b, mypmf, 'k-')
plt.ylabel('$P(b \mid N = 493472, p = 0.5)$', fontsize=15)
plt.xlabel('$b$', fontsize=15)


# P(b = 251527 | N = 493472, p = 0.5)
stats.binom.pmf(251527, N, p)

With resulting in this tiny number. Naively, one might interpret this tiny probability as an indication that the 50:50 hypothesis is wrong. But as Sean pointed out in lecture, even , where you plug in exactly for , has a small value.

# P(H = 246736 | N = 493472, p = 0.5)
stats.binom.pmf(246736, N, p)

The knee-jerk reaction to reject these scenarios based on low likelihoods is flawed because the probabilities are low only because we are hoping for an exact match over many different possibilities in the mass function.

# Let's watch the pmf evolve as N increases
p = 0.5
N = [10, 100, 200, 500]
fig = plt.figure()

lines = []
for idx, N_ in enumerate(N):
    b = list(range(0, N_+1))
    mypmf = stats.binom.pmf(b, N_, p)
    lines += plt.plot(b, mypmf, '.-', label="$N = {}, p = 0.5$".format(N_))

plt.ylabel('$P(b \mid N , p)$', fontsize=15)
plt.xlabel('$b$', fontsize=15)
labels = [l.get_label() for l in lines]
plt.legend(lines, labels)
#ax = fig.axes
#plt.legend([ax[0]], ['t1'])


The inverse problem

Besides, Laplace was actually interested in the inverse problem, not the forward problem. Namely, . That is, what is the probability that the boy birth probability is equal to , across all values of p (with particular interest in and )?

But how do you calculate this inductive inverse probability? As Sean mentioned in lecture, Laplace’s big logical leap was realizing that

Which directly related the forward problem (which you’ve now seen is easy to solve) to the inverse problem.

Using this relationship, we can pull out alone (instead of its ratio to another probability) by normalizing by the sum of all possible values in both denominators:

Because probabilities sum to one, , and everything simplifies to:

If we now calculate this for every possible , we end up with the continuous probability density function

Which will indicate the probability that equals any arbitrary value from 0 to 1, given some known values of and !

Capitalizing on cumulative probability

The problem with a continuous probability distribution, however, is that while the integral over the distribution is 1, any given value is actually infinitesimally small (think of the graphical illustration, above, of the pmf shrinking but extended to points over the domain). From the probability density function, you can approximate the probability of a specific value in a local neighborhood, but you can also totally use the function to ask for a partial sum over many values in the distribution. Given that Laplace was interested in determining whether the observed birth proportion was real, he decided to ask what the probability was that , essentially pitting , which of course contains , against .

To determine the winner, Laplace asked for the ratio between the area under the distribution from to divided by the total area. That is, what share of the distribution did occupy?

As Sean pointed out in lecture, this expression is closely related to the beta integral, and Python conveniently gives us a function to solve it:

import scipy.special as special

p = 0.5
N = 493472
b = 251527

special.betainc(b+1, N-b+1, p)
# Now let's plot P(p <= 0.5 | b, N) for different values of b
p = 0.5
N = 493472
b = list(range(240000,260000,10))

share = [None] * len(b)
for idx, point in enumerate(b):
    share[idx] = special.betainc(point+1, N-point+1, p)

# The red circle shows b = N/2 (exactly 50% boys) for reference.
plt.plot(N/2, special.betainc(N/2+1, N-N/2+1, p), 'or', markersize=20)
plt.plot(b, share)
plt.ylabel('$P(p \leq 0.5 \mid b, N)$', fontsize=15)
plt.xlabel('$b$', fontsize=15)


note: these are a supplement to the lecture notes (you should read them side by side).

What is Bayesian Analysis?

Bayesian analysis gives you a formal way to update your beliefs about something. You start with a prior (e.g. P(Sean is a good professor)=P(S_good)=0.5), and then this gets updated with new data. For example, P(S_good Sean tells a bad joke)=P(Sean tells a bad joke S_good)*P(S_good)/P(Sean tells a bad joke)). P(Sean tells a bad joke S_good) is the likelihood, in this case the likelihood that Sean is a good professor given the data that he told a bad joke. The normalization is P(Sean tells a bad joke) which can be written as P(Sean tells a bad joke S_good)P(S_good)+P(Sean tells a bad joke S_bad)P(S_bad).

What is a likelihood?

A likelihood is a statistical tool that can be used to estimate parameters underlying some data. It’s generally associated with the frequentist perspective (vs. Bayesian approach). The likelihood is usually of the form P(Data | p), where p is the parameter of interest. This ends up being a function of the parameters essentially showing the relative likelihood of the data for different values of the parameter. You can use this to determine the most likely parameter by finding the maximum of the function (i.e. for what parameter is the relative likelihood highest?). To maximize the function usually you take the derivative and set it to 0 (this is where the slope is 0 and changing from positive to negative) and ensure that the second derivative is negative (so the function is decreasing). To set a constraint that the probabilities sum to 1, you can use Lagrangian multipliers which basically just involve adding a to your likelihood function and solving it, where g(x,y) is $\sum(P)=1$.

What are log likelihood ratios?

Log likelihood ratios allow you to express the relative likelihoods of different hypotheses. Usually this is expressed as $$log(\frac{P(Data H_1)}{P(Data H_0)}H_0H_1P(H_1 Data)$$ (see lecture notes).

What are pdfs and cdfs?

Probability density functions (pdfs) are functions that show the probability of the random variable at each possible value. These must integrate (or sum) to 1. Cumulative distribution functions (cdfs) are essentially the integral of the pdf up until that value. They are the probability that the random variable will take a value less than or equal to the specific value. The derivative of the cdf at each point is the pdf.

What are beta integrals?

Beta integrals (or Beta functions) are the normalization of the beta distribution. The Beta distribution is usually used to attain a prior for the parameter underlying distributions like the Bernoulli distribution, because the Beta distribution gives the probability of something between 0 and 1 (and since probabilities are between 0 and 1, they work well as the “probability of a certain probability”). The beta integral is the denominator of the beta distribution that is needed to make the beta distribution integrate to 1. The above example of Laplace used a binomial distribution, so the beta integral is like integrating the binomial pdf and allows you to find the cdf of the binomial in Laplace’s likelihood ratio (the binomial coefficients cancel out).

What are Markov Models?

Markov models are useful models for more efficient analysis of a system where the parts of the system are generally only dependent on the local context (and not on things far away from it). In the simplest case (a 1st order markov model), the states of the system are only dependent on the last previous state. So in the case of a sequence of DNA, the identity of a nucleotide (A, G, C, or T) at a particular point is only dependent on the previous nucleotide and not on any of the nucleotides before or after that. This is only 1 step up from the i.i.d (independent and identically distributed) case where all nucleotides are independent of all other nucleotides in the sequence. This can be extended to 2nd order Markov models, where the nucleotide at a particular point will only be dependent on the last 2 nucleotides, and so on for third and above order Markov Models. These Markov Models allow more efficient computation because they mean that you don’t need to account for any dependencies except the very local context.

Another way to alter the i.i.d model is to remove the identically distributed assumption such that the position of the nucleotides are actually important. For example, in DNA binding motifs, at one location there might be more likely to have an A and at another location it might be more likely to have a T, etc. In this week’s problem set, however, we won’t deal with this.

ROC plots

ROC plots are plots for determining how well your classifier is doing. It’s basically a plot of sensitivity (true positive rate) vs. false positive rate (1-specificity) for all possible thresholds. The diagonal would be the equivalent of random chance while anything above the diagonal would be better, because as you decrease your threshold going to the right, you increase your false positive rate more slowly than you are increasing your true positive rate.

Problems with ROC plots: Positive Predictive Value and False Discovery Rate as Alternatives

In biology there are often difficulties with ROC plots because in reality there are far far more negatives than there are positives. Your prior for seeing a negative is thus much higher than for seeing a positive. False Discovery Rate (FDR) and Positive Predictive Value (PPV) account for these, because they look at both the true positives and false positives. So if know that 99 percent of the things you’re looking at are supposed to be background noise, then essentially you are multiplying the FP by 0.99 and the TP by 0.01 in the formula for FDR=FP/(TP+FP) to account for the fact that only 0.01 of the thing you are looking at are likely to be true positives.

Simpson’s Paradox

The FDR is related to Simpson’s paradox which can be shown by a simple illustration. Let’s say there are two basketball players, Stephen Curry, and Dwight Howard. Let’s say (made-up numbers) Stephen Curry shoots 3-pointers at 45 percent accuracy while Dwight Howard shoots 3-pointers at 10 percent accuracy. In addition, Stephen Curry shoots 2-pointers at 75 percent accuracy while Dwight Howard shoots 2-pointers at 65 percent accuracy. However, Stephen Curry’s overall shooting percentage is 50 percent while Dwight Howard’s overall shooting percentage is 60 percent. How can that be? Isn’t Stephen Curry better at shooting at both 3 and 2 point range? The solution to the paradox is in numbers. Dwight Howard shoots more 2-pointers than Stephen Curry so more of his overall proportion comes from the 2-point percentage, and since this (65 percent) is higher than Stephen Curry’s 3-point percentage (45 percent), it is possible for Dwight to still have a better overall shooting percentage. In the case of FDR it could be that at a certain threshold the pathogen is more easily predicted as true if it really is a pathogen but the FDR could still be such that there is still a greater than 50 percent chance of a false positive if there are significantly more negatives than positives in the dataset.

Occam’s Razor and Bayesian analysis

Occam’s razor is the thought that simpler models (those with fewer parameters) are usually right. This is because models with too many parameters tend to over-fit to the training data and have poor predictive value for new data that comes in (test data). See lecture notes for details, but the gist of it is that Bayesian analyses automatically penalize models with too many parameters. This is because Bayesian priors end up spreading across all of the parameters so the posterior probability distribution of a model with many parameters will be less concentrated at the true parameter than in models that have fewer parameters.