Section 4: Intro to Probability

Notes by Kate Shulgina [10/5/18], adapted from Tim Dunn [9/23/16]

(You can download these notes as a jupyter notebook page)

Laplace made significant contributions to probability theory by hammering home the idea of inverse probability. This concept is important because it provides a foundation for solving the inverse problem, the typical inductive task we, as scientists, are faced with when analyzing data.

Deductive vs. inductive logic

Deduction is used when a cause (or model) is known, and we would like to know the resulting effects (or the output from the model). This type of reasoning is common in mathematics, where relationships and fromulae can be deduced from given axioms. But deductive reasoning also allows us to predict the odds of, say, a blackjack hand, knowing the number of cards in a deck, the number of cards dealt to each player, number of face cards revealed, etc. In this blackjack example, the model is everything you know about the game, and the model is composed of individual model parameters (i.e. number of cards in a deck, number of players, etc.) that together define the output of the model. When a model is nice and parameterized, the parameters themselves can be updated, and a change in model output can be observed (e.g. how do the odds change as fewer and fewer cards remain in the deck?). Deduction is also known as the forward problem.

Deduction:

$$ \begin{align} cause &\to effect \\ model &\to ouput \end{align} $$

Induction is used to go the other way, determining the cause from many observations of the effects. This is the problem that biologists are typically trying to solve. Given some experimental data, what was the underlying cause of the data (i.e. the true nature of the underlying system)? This inverse problem is usually harder to tackle because often data are consistent with many different possible causes (or combinations of causes). In the blackjack example, let's suppose that you see 10 hands dealt in a row. With that information, can you now determine whether the dealer is playing with a standard deck of cards?

Induction:

$$ \begin{align} effects &\to cause(s) \\ output &\to model \ parameter(s) \end{align} $$

Laplace's inverse problem

Laplace faced a similar inverse problem when confronted with the London and Paris census data. He observed that the boy/girl ratio in the census data was not quite 50:50—how could he go about determining whether the actual birth rate was biased or whether the ratio he observed was simply due to noise?

Laplace knew that he could solve the forward problem using the binomial distribution, which describes the probability of a certain number of "successes" in a string of independent binary (i.e. boy/girl) events (i.e. births).

Random variables and probability models

Before we dive into binomial distributions, let's review some basic probability concepts.

A random variable is a variable that takes on one of a defined set of values with some probability. Random variables are usually written as uppercase letters, and the values they assume are written lowercase. The set of values that a random variable can assume is called the sample space.

For example, let $W$ be the random variable of the weather tomorrow. Let's say the weather can be either sunny or rainy, so sample space is $W \in \{sunny, rainy\}$.

The PMF (or probability mass function) describes how to compute the probability of any value in the sample space. So for example, the PMF could look like $$P(W=w) = \left\{ \begin{array}{ll} 0.3 & \text{if } w = sunny \\ 0.7 & \text{if } w = rainy \\ \end{array} \right.$$

We're saying that if there were an infinite number of identical parallel universes, then we think that 70% of them would have a rainy day tomorrow and 30% of them would have a sunny day. The probabilities over the entire sample space always sum to 1.

As scientists, we often use probability models to explicitly write down in our hypothesis of how some system works. We can use probability models to quantify how well the observed data fits the the model we defined, or whether one model is better than another. When we define a probability model, we typically pick a distribution and values for its parameters.

So for the weather example, the distribution that we picked was the Bernoulli distribution, which describes the probability of a single yes-no event. We write that as $$ W \sim \text{Bernoulli}(p=0.3)$$

which is read verbally as "W is distributed Bernoulli with parameter $p$ = 0.3". The Bernoulli distribution has the PMF that we saw above. Notice that we also have to pick a value for the parameter $p$ if we're going to model something with this probability distribution. When we compare how well different probability models fit our data, we often are comparing different values for the parameters.

Binomial distribution

A binomial distribution can be thought of as the result of $N$ Bernoulli experiments. It describes the number of successes out of $N$ Bernoulli trials, each with probability $p$ of a success. So in Laplace's census data, the birth of each child is an independent Bernoulli trial, where each child as some probability $p$ of being a boy and probability $1-p$ of being a girl. We write that we are modelling the the total number of boys $B$ out of a total number of births $N$ as a binomial distribution as:

$$B \sim \text{Binomial}(p, N)$$

The sample space of $B$ is $\{0,1,2,...,N\}$

The binomial distribution has the following PMF:

$$P(B=b) = \binom{N}{b}p^{b}{(1-p)}^{N-b}$$

To make it more concrete, suppose there are only 5 births (so $N = 5$) and 2 of them are boys ($b=2$). If we model births with a binomial distribution with even probability of either gender ($p=0.5$) then we can calculate the probability of getting 2 boys as:

$$P(B=2 \mid N=5, p=0.5) = \binom{5}{2}0.5^{2}{(1-0.5)}^{5-2} = 0.3125$$

The binomial coefficient is shorthand for $\frac{5!}{2!(5-2)!}$ and is read as "five choose two". It gives you the number of ways you can get two boys and three girls.

Why do we need the binomial coefficient in the PMF formula? In our example, the probability of boys and girls is the same. So seeing the combination {G,G,G,G,G} has the same probability as the combination {B,G,G,G,G}. However, we are counting the total number of boys and there are different numbers of combinations that can give you the same total number of boys. There is only one way you could get no boys ({G,G,G,G,G}), five ways you could get 1 boy ({B,G,G,G,G},{G,B,G,G,G}, etc), ten ways you could get 2 boys, etc... The binomial coefficient reflects that some values are more likely because there are more ways to achieve combinations that give you those values, and increases the probability of those values by that amount.

Instead of calculating the binomial PMF by hand, we can have Python do it for us with the handy scipy.stats module:

In [ ]:
import scipy.stats
import numpy as np

b = 2                                 # number of boys
N = 5                                 # total number of children born
p = 0.5                               # probability of a boy birth, in our model
scipy.stats.binom.pmf(b, N, p)        # evaluates PMF at P(B=b)

Let's use Python to compute the probability of Laplace's observations, if the probability of boys and girls is equal. He observed in the Paris census data that there were 493472 births, 251527 of which were boys.

In [ ]:
b = 251527                            # number of boys
N = 493472                            # total number of children born
p = 0.5                               # probability of a boy birth, in our model
scipy.stats.binom.pmf(b, N, p)        # evaluates PMF at P(B=b)

The probability is tiny! Does this mean that $p=0.5$ hypothesis is wrong? Not necessarily. As Sean pointed out in lecture, even $P(b = 246736 \mid N = 493472, p = 0.5)$, where you plug in exactly $N/2$ for $b$, has a small value.

In [ ]:
b = 246736                            # number of boys
N = 493472                            # total number of children born
p = 0.5                               # probability of a boy birth, in our model
scipy.stats.binom.pmf(b, N, p)        # evaluates PMF at P(B=b)

This is because any particular value for $b$ is unlikely because there are so many other similarly probable values. Indeed, for a continuous distibution, any particular value for the random variable has probability 0 because there are infinitely many possible values.

We can use the power of modern-day computing to visualize the entire PMF in Python over all possible values for $b$ to get a better sense of what's going on:

In [ ]:
import matplotlib.pyplot as plt

# Set the parameters
p = 0.5                        # equally probable to get boys and girls
N = 493472                     # total number of births
B = list(range(0, N+1))        # a list of all possible values of b

# Use scipy.stats binomial equation to generate a probability for each b
# Note that because we are passing in a list of b, the pmf function returns a list of calculated P for each b
mypmf = scipy.stats.binom.pmf(B, N, p)

# Plot the result
plt.plot(B, mypmf)
plt.ylabel('$P(B \mid N = 493472, p = 0.5)$', fontsize=15)
plt.xlabel('$B$', fontsize=15)
plt.show()
In [ ]:
# Plot the result
plt.plot(B, mypmf)
plt.ylabel('$P(B \mid N = 493472, p = 0.5)$', fontsize=15)
plt.xlabel('$B$', fontsize=15)
plt.xlim((240000,255000))
plt.show()
In [ ]:
# Sanity check: does PMF sum to 1?
print('Binomial PMF sums to: %f' % sum(mypmf))
In [ ]:
## other useful functions:
# compute the cumulative probability, P(B<=b)
scipy.stats.binom.cdf(b, N, p)
In [ ]:
# sample 10 from binomial distribution with these parameters
np.random.binomial(N, p, 10)

Likelihoods and maximum likelihood parameter estimation

What Laplace really wanted to do was figure out is whether the observed frequency of boy births in Paris was inconsistent with a model where boys and girls are born equiprobably. Namely, he wanted to compare two models, one where $p=0.5$ and one where $p=\hat{p}^{ML}$ (the maximum likelihood estimate for $p$).

Maximum likelihood estimation is a way of making a best guess for the value of a model parameter from some data. As the name suggests, it is the value of the parameter that maximizes the likelihood.

Likelihood is a confusing bit of terminology. The likelihood of a model is defined as the probability of the data, given a model. It is the solution to the forward problem that we are familiar with:

$$P(data \mid model) = P(b \mid N, p)$$

However, now we think about the probability of the observation as a function of the model parameter. There are some values of $p$ that the model gives a higher probability for your data (like $p=0.51$) than others (like $p=0.1$, which is pretty bad).

We can use Python to visualize this.

In [ ]:
# Set the parameters
p_vec = np.linspace(0,1,10000)
N = 493472
b = 251527 

# Use scipy.stats binomial equation to calculate each likelihood
# We can pass in a list of p's!
likelihoods = scipy.stats.binom.pmf(b, N, p_vec)

# Plot the result
plt.plot(p_vec, likelihoods)
plt.ylabel('$P(B \mid N = 493472, p)$', fontsize=15)
plt.xlabel('$p$', fontsize=15)
plt.xlim((0.475,0.55))
plt.show()

Notice that now $p$ is the x-axis!

We can find the maximum likelihood estimate of $p$ by just picking off the best value. The numpy argmax function can help us here.

In [ ]:
# find index that maximizes likelihood]
pml_index = np.argmax(likelihoods)
p_mle = p_vec[pml_index]
print('The maximum likelihood estimate for p is %f' % p_mle)

Don't confuse this with the PMF, which gave a very similar plot earlier. Unlike the PMF, the integral of this function does not have to equal to 1.

A cleaner way to calculate the $p=\hat{p}^{ML}$ would be to do it mathematically— take the derivative of the likelihood with respect to $p$, set it to 0 and solve for $p$. If you were to do that, you would find that the ML estimate for $p$ is simply the observed frequency of boys: $$\hat{p}^{ML} = \frac{b}{N}$$ Most of the time, you'll find that the maximum likelihood estimate is also the obvious and intuitive one.

Model comparison

Now we want to compare the two models that Laplace was considering: one where $p=0.5$ and one where $p=\hat{p}^{ML}$. We know how to calculate likelihoods, $P(data\mid model)$, but that's not a value that directly answers our questions.

We wish we could calculate and compare $P(model \mid data) = P(p \mid b, N)$ instead, because then we could directly compare those values and see which $p$ is more likely.

But how do you calculate this inductive inverse probability? As Sean mentioned in lecture, Laplace's big logical leap was realizing that

$$ \begin{align} \frac{P(p_{1} \mid b,N)}{P(p_{2} \mid b,N)} \propto \frac{P(b \mid p_{1},N)}{P(b \mid p_{2},N)} \end{align} $$

Why is this true? Remember Bayes rule: $$P(p \mid b, N) = \frac{P(b \mid p, N) P(p)}{P(b \mid N)}$$

Plugging that into the ratio of likelihoods $$ \begin{align} \frac{P(p_{1} \mid b,N)}{P(p_{2} \mid b,N)} &= \frac{P(b \mid p_1, N) P(p_1)}{P(b \mid N)} \times \frac{P(b \mid N)}{P(b \mid p_2, N) P(p_2)} \\ &= \frac{P(b \mid p_1, N) P(p_1)}{P(b \mid p_2, N) P(p_2)} \\ &\propto \frac{P(b \mid p_{1},N)}{P(b \mid p_{2},N)} \qquad\text{if } P(p_1) = P(p_2) \end{align} $$

So now that we understand Bayes rule, we can see that the ratio of the posteriors is proportional to the ratio of the likelihoods if we assume a uniform prior, i.e. if we give the two hypotheses the same a priori probability of being the correct hypothesis.

So let's make that assumption and calculate the ratio of the likelihoods in Python:

In [ ]:
likelihood_0 = scipy.stats.binom.pmf(b, N, 0.5)    # likelihood of model where p=0.5
likelihood_1 = scipy.stats.binom.pmf(b, N, p_mle)  # likelihood of model where p=MLE

print("The likelihood ratio is %.2E" % (likelihood_1/likelihood_0))

Oftentimes, likelihoods and likelihood ratios can become very very large or small values, to the point where there is a risk of overflow or underflow. Computers can only hold numbers up to a finite number of digits, so if a number gets to small, it will generate an error. For this reason, likelihoods are often computed as log-likelihoods, which keeps the value within a range that a computer can manipulate.

The log-likelihood ratio is just the difference of the log-likelihoods. Another nice property of log-likelihood ratios is that they are positive when supporting model 1 and negative when supporting model 0.

In [ ]:
log_lik_0 = np.log(likelihood_0)
log_lik_1 = np.log(likelihood_1)
log_lik_ratio = log_lik_1 - log_lik_0

print("The log-likelihood ratio is %.2f" % log_lik_ratio)

ROC plots

ROC plots are a visualization tool used to compare the performance of binary classifiers, i.e. something that tries to sort inputs into positives and negatives, enemy planes and friendly planes, pathogen sequences and sandmouse sequences.

This classifier score each input and then decides whether or not the input is a positive or negative based on whether it is greater than some threshold.

For example, suppose we had a system of looking at the radar signature of planes and assigning them a score based on whether they were enemy planes or friendly planes. These are the score we get for a set of known enemy and friendly planes

In [ ]:
import matplotlib.pyplot as plt

# scores generated by our radar plane classification tool
enemy_scores = np.array([44,50,77,65,87,56,32,62,63,69])
friendly_scores = np.array([61,51,34,42,45,32,22,29,37,32])

# let's plot the scores to see how the distributions overlap
plt.hist(enemy_scores, alpha=0.3, color='red', bins=np.arange(0,100,5))
plt.hist(friendly_scores, alpha=0.3, color='blue', bins=np.arange(0,100,5))
plt.xlim((0,100))
plt.xlabel('radar scores')
plt.ylabel('counts')
plt.show()

An ROC curve characterizes how good a classifier is by scanning the threshold across all possible values and plotting how the false positive and true positive rates change.

  • True positive: how many of the positives were correct? How many of the planes above the threshold are really enemy planes?
  • False positive: how many of the positives were wrong? How many of the planes above the threshold are really friendly planes? We want to keep this number low... don't want to shoot down friendly planes.
  • True negative: how many of the negatives were correct? How many of the planes below the threshold are really friendly planes?
  • False negative: how many of the negatives were wrong? How many of the planes below the threshold are really enemy planes? We want this value to be low, don't want to let through enemy planes!
  • True positive rate = TP/(TP+FN). Of all the enemy planes, what proportion were above the threshold?
  • False positive rate = FP/(FP+TN). Of all the friendly planes, what proportion were above the threshold?

Suppose we set the threshold to be 0. What is the true positive rate? What is the false positive rate? Threshold at 90? Threshold at 60?

In [ ]:
# Suppose we set the threshold to be 60. 
threshold = 60

# What is the true positive rate?
tp = np.sum(enemy_scores > threshold)
tp_plus_fn = len(enemy_scores)
tp_rate = tp / tp_plus_fn
print('If the threshold is set at %i, then TP=%i, TP+FN=%i, and the true positive rate is %.1f' % 
      (threshold, tp, tp_plus_fn, tp_rate))

# What is the false positive rate?
fp = np.sum(friendly_scores > threshold)
fp_plus_tn = len(friendly_scores)
fp_rate = fp / fp_plus_tn
print('If the threshold is set at %i, then FP=%i, FP+TN=%i, and the false positive rate is %.1f' % 
      (threshold, fp, fp_plus_tn, fp_rate))

As you can see, all ROC curves are anchored at (0,0) and (1,1), and good classifiers maintain a low false positive rate while having a high true positive rate for some range of thresholds.

To plot the ROC curve, we need to sweep across all values of the threshold and calculate the true postive and false positive rates.

In [ ]:
max_value = np.max([enemy_scores, friendly_scores])

fp_rates = np.zeros(max_value)        # initialize array to keep track of false positive rates
tp_rates = np.zeros(max_value)        # initialize array to keep track of true positive rates

for threshold in range(max_value):
    # calculating true positive rate
    tp = np.sum(enemy_scores > threshold)
    tp_plus_fn = len(enemy_scores)
    tp_rates[threshold] = tp / tp_plus_fn
    
    # calculating false positive rate
    fp = np.sum(friendly_scores > threshold)
    fp_plus_tn = len(friendly_scores)
    fp_rates[threshold] = fp / fp_plus_tn

plt.plot(fp_rates, tp_rates)
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.xlim((0,1))
plt.ylim((0,1))
plt.show()

Playing around with joint probabilities

In the end of Sean's lecture notes, there is a nice example of how to manipulate joint probability tables. The example is the following:

Suppose we've done a single cell RNA-seq experiment, where we've treated cells with a drug (or left them untreated), and we've measured how often (in how many single cells) genes A and B are on or off. Our data consist of counts of 2000 individual cells:

T=no T=yes B=ON B=off B=ON B=off A=ON 180 80 10 360 A=off 720 20 90 540
In [ ]:
# let's represent these tables as 3D-numpy arrays
treat_no_counts = np.array([[180, 80],[720, 20]])
treat_yes_counts = np.array([[10, 360],[90, 540]])
counts = np.array([treat_no_counts, treat_yes_counts])

print('T=no')
print(counts[0])

print('\nT=yes')
print(counts[1])
Joint probability

The joint probabilities P(A,B,T) sums to one over everything. So normalize by dividing everything 2000 (the total # of cells):

In [ ]:
# double check the total number of cells in 2000
n_cells = np.sum(counts)
print('Total number of cells in %i' % n_cells)
In [ ]:
# normalize probabilities by dividing by 2000
joint = counts / n_cells

print('T=no')
print(joint[0])

print('\nT=yes')
print(joint[1])

np.sum(joint)
Marginalization

Any marginal distribution is a sum of the joint probabilities over all the variables you don't care about, leaving the ones you do. For example, to get $P(T)$, we sum over $A$,$B$: $P(T)=\sum_{A,B} P(A,B,T)$, which leaves:

In [ ]:
# let's get P(T) by marginalizing (summing) over A and B

print('P(T=no) = %.1f' % np.sum(joint[0]))
print('P(T=yes) = %.1f' % np.sum(joint[1]))
Conditional distributions
In [ ]:
# what if we wanted P(A,B | T=no)? Could get table for T=no and renormalize probabilities
joint_treat_no = joint[0]
joint_treat_no = joint_treat_no / np.sum(joint_treat_no)

print('P(A,B | T=no):')
print(joint_treat_no)

np.sum(joint_treat_no)

Marginalizing a conditional

If we marginalize a conditional distribution, the conditioning stays as it was. For example, $P(B∣T=\text{no})=\sum_{A}P(A,B∣T=\text{no})$

In [ ]:
# what if we wanted P(B | T=no)? Need to marginalize (sum) over A
print('P(B | T=no): ')
np.sum(joint_treat_no, axis=0)