Bayes rule, maximum likelihood, and the t-distribution, briefly

Kevin Mizes - 10/12/18

(You can download these notes as a jupyter notebook page)

My goal for this section is to do 2 things:

  1. Try to provide some basic intuition for the student's t-distribution
  2. Practice some pen and paper math with Bayes rule and maximum likelihood
Some basic practice using Bayes rule

Let's start off simple and work with an classic example problem I found on the internet (citation), similar to our homework from last week.

On our homework last week, we developed a method to tell apart sand mouse transcripts from pathogen transcripts. Let's say we ended up being 90% accurate when testing for pathogen, but had a 5% false positive rate. However, only 1% of reads are actually from the pathogen.

1.What is the probability that the test will come up positive?

$$ P(T=positive) = P(T=positive|R=sandmouse)P(sandmouse) + P(T=positive|R=pathogen)P(pathogen) = .05*.99 + .9 * .01 = 0.0585 $$

2.Given the positive result, what is the probability that we are actually reading from a pathogen? (Try it yourself, but answers are at the bottom...)

$$ P(R=pathogen | T=positive) = \;?$$

What about the probability that we're reading from a mouse if the test is positive?

Priors: Additive smoothing and the sunrise problem

Laplace posed a problem: "what is the probability that the sun will rise tomorrow?". While it may seem obvious to you or me, with probability theory, since we can have both outcomes (the sun could go nova or something over night read), we need to consider all outcomes to an event. Additionally, say we have no obsevations. How can we compute the posterior then? In Laplace's solution (rule of succession, if you want to see it derived), he found

$P(\text{sun will rise on day D+1}\; | \; \text{it has risen k times previously}) = \frac{k+1}{D+2} $

This solves a nice problem. On day $D=0$, when I've seen no sunrises, I can still make some statement about the question. Even though I've seen no sunrises, I still know about the outcomes (it comes up or it doesn't), and I will say $p = 0.5$. This is thanks to additive smoothing, where we add pseudocounts to each of the outcomes (I pretend I've seen one sunrise and one nova). If I know nothing about the sunrises and have no observations, I can still make a probabalistic statement about an outcome. However, since I've been alive 24 years, I've seen 8760 sunrises, and I will say $p = 8761 / 8762 = 0.9998858708$ that the sun comes up tomorrow. So based on past observations, I can write down a probability that the sun will come up.

However, Laplace makes another comment on this point, that, although this probability is high, it still seems low to someone who might understand more about the mechanisms of the sun, and how frequently suns go nova. Thus we aren't using all of our prior information. A final point that this makes is with this smoothing and with prior information (I know that going nova is an option) we are never 100% sure of anything (though we get pretty close. In fact I think there's a term for this called almost surely)

The sun going nova and priors: another example at bayes rule

alt text

Well now that we know something about Bayes' theorem, lets try calculating the posterior!

Find

$$ P(sun=nova \; |\; detector=yes) $$

Maximum likelihood

Last week Kate showed you how to do this in Python, but this week lets derive the MLE analytically. First, what is our problem? Often we are looking for the probability of the data $D$ given some parameters $\theta$. In statistical inference, we want to the parameters from the observed data. Often it is useful to have an analytical solution.

A numerical example

Lets start with a problem from last week. Suppose there are 5 birth ($N=5$), and 2 are boys ($b=2$). Last week we found $P(B=2 | N=5, p=.5)$. But now I want to know what is the most likely value of $p$ given our estimations. To do this, we want to find the value of $\hat{p}$ that makes maximizes the likelihood $P(data|p)$. (If you remember your calculus, this is done by setting derivative equal to 0).

$$ \begin{aligned} P(data | p) &= P(boys=2, N=5 | p)\\ &= {5 \choose 2} p^2 (1-p)^{5-2} \\ \frac{d}{dp} P(data | p) &= \frac{d}{dp}{5 \choose 2} p^2 (1-p)^{5-2} = 0\\ &= 2p^1 (1-p)^{3} - p^2 3(1-p)^2 = 0 \\ &\rightarrow 2(1-p) = 3p \rightarrow \hat{p} = 2/5 \end{aligned} $$

So we've found our MLE, which happens to seems to align with what we might have come upon if we tried to do this based on intuition.

MLE for Binomial using log-likelihoods

That was great at all, but can we be more general? Let's say we had N births, and k were boys. What is the value $\hat{p}$ that maximizes our likelihood $P(data|p)$? In this case we're going to find the maximum of the log-likelihood instead of the likelihood. Since the logarithm is a strictly increasing function, the location of the maximum should be exactly the same! (Thus we use log-likelihoods not only to avoid underflow errors on computers, but also to make math nicer when dealing with probability distributions).

$$ \begin{aligned} P(data| p) &= P(N, k | p) = {N \choose k} p^k (1-p)^{N-k}\\ log P(data|p) &= log({N \choose k}) + k log(p) + (N-k) log(1-p)\\ \frac{d}{dp} log P(data|p) &= \frac{k}{p} - \frac{N-k}{1-p} = 0 \\ &\rightarrow \hat{p} = \frac{k}{N} \end{aligned} $$

Great, this looks like what we'd expect.

MLE for Normal distribution

Ok let's try something slightly harder and see what we get

$$ \begin{aligned} P(data| p) &= P(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} exp(-\frac{(x-\mu)^2}{2\sigma^2}) \end{aligned} $$

But now we have data points $(x_1, ..., x_i)$. Since each point is an independent sample, our likelihood function becomes

$$ \begin{aligned} L(x | \mu, \sigma^2) &= \prod_{i=1}^N \frac{1}{\sqrt{2\pi\sigma^2}} exp(-\frac{(x_i-\mu)^2}{2\sigma^2}) \\ &= \frac{1}{\sqrt{2\pi\sigma^2}^N} exp(- \sum_{i=1}^N \frac{(x_i-\mu)^2}{2\sigma^2}) \\ log L(x| \mu, \sigma^2) &= -\frac{N}{2}log(2\pi\sigma^2) - \sum_{i=1}^N \frac{(x_i-\mu)^2}{2\sigma^2} \end{aligned} $$

This time we have two parameters, $\mu, \sigma^2$ we are looking for. We'll take the derivative with respect to each parameter separately,

$$ \begin{aligned} \frac{d}{d\mu} log L(x|\mu, \sigma^2) &= \sum_{i=1}^N \frac{(x_i-\mu)}{\sigma^2} = 0\\ &\rightarrow \sum_{i=1}^N x_i = N \mu \\ \hat{\mu} &= \frac{\sum_{i=1}^N x_i}{N} \\ \frac{d}{d\sigma^2} log L(x|\mu, \sigma^2) &= -\frac{N}{2\sigma^2} + \sum_{i=1}^N \frac{(x_i-\mu)^2}{2(\sigma^2)^2} = 0 \\ \hat{\sigma^2} &= \frac{1}{N}\sum_{i=1}^N (x_i-\mu)^2 \end{aligned} $$

A biological example: Hardy-Weinberg

Let's say a gene occurs on one of two alleles (A and a) with probability p for A, and (1-p) for a. Thus, in a population

A ($p$) a ($1-p$)
A ($p$) AA ($p^2$) Aa ($p(1-p)$)
a ($1-p$) Aa ($p(1-p)$) aa ($(1-p)^2$)

Now I want to find out what is the probability p of allele A occuring. I've sampled from the population, and found that gene AA shows up in $k_1$ people, Aa in $k_2$ people, and aa in $k_3$ people. What is the MLE estimate of p?

Solution: $$ \hat{p} = \frac{2k_1 + k_2}{2k_1 + 2k_2 + 2k_3} $$

The effect of priors: binomial revisited

So far, we've just been looking at the likelihood. This gives us the maximum likelihood estimate. However, with our Bayesian formulation, we can more explicitly look at our parameters in the posterior, where $P(\theta | data) \propto P(data | \theta) P(\theta) $. When we look for the most likely parameter using the posterior, the estimate is now called a maximum a posteriori probability (MAP) estimate. In some cases (e.g., the homework), the prior is simply a uniform distribution, and when we do our maximization, the constant prior term should just drop away. However, in some cases, we are interested in the prior.

Lets return to the binomial distribution, but this time, we want to find the MAP estimate of $p$. We will use a common prior for the binomial distribution, called the beta distribution. Recall for this, our data are counts ($k$, $N$) and our parameter is a probability $p$. We are also introducing two hyperparameters, $\alpha$ and $\beta$ which are part of the beta distribution (they'll make intuitive sense in a minute).

$$ \begin{aligned} P(\theta| D) &= P(D| \theta) P(\theta) / P(D) \\ P(p | N,k) &\propto P(N,k | p) P(p | \alpha, \beta) / P(D) \\ \end{aligned} $$

Let's write out the likelihood and the prior $$ \begin{aligned} P(N,k | p) &= {N \choose k} p^k (1-p)^{(N-k)} \\ P(p | \alpha, \beta) &= \frac{1}{B(\alpha, \beta)} p^{\alpha-1} (1-p)^{\beta-1} \end{aligned} $$

Don't worry about the normalization term $B(\alpha, \beta)$. Since we are taking an argmax, we can ignore terms that don't depend on $p$! This includes the $P(D)$ term in the denominator. From here on, I'm going to ignore these terms and just use a normalization constant Z. On to the posterior and the argmax.

$$ \begin{aligned} P(p | N,k) &= \frac{1}{Z} p^{k+\alpha-1} (1-p)^{N-k +\beta-1} \\ log P(p|N,k) &= (k+\alpha -1) log(p) + (N-k+\beta-1) log(1-p) - log(Z)\\ \frac{d}{dp} log P(p|N,k) &= \frac{k + \alpha -1}{p} + \frac{N-k+\beta-1}{1-p} = 0\\ &\rightarrow \hat{p} = \frac{k+\alpha-1}{N+\alpha+\beta-2} \end{aligned} $$

Looks like when we included the prior, for $\alpha = \beta = 2$, Laplace's additive smoothing showed up in the hyperparameters!

Student's t-distribution

Often, we find ourselves sampling data from a Gaussian distribution. However (as you might find on your homework) directly trying to fit a Gaussian distribution to a small number samples can give us inaccurate parameter estimations!

This occurs because we are taking samples from the distribution, and there is some inherent variance in our samples that we are not modeling whehn we directly fit a Gaussian distribution.

From the MLE example above, for i samples from normally distributed data $x_1, ..., x_i$, we found that:

$\bar{\mu} = \frac{1}{n} \sum_{i=1}^n x_i$

$\bar{\sigma^2} = \frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2 $

For the t-distribution, we instead estimate the sample mean and sample variance from the data:

$\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i$

$\bar{s^2} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 $

Note: where does the n-1 come from? Unbiased esimator and explanation here. Short answer: if you calculate variance using formulat from MLE, you get a factor of $\frac{n-1}{n}$ that biases our answer.

For any random variable $(x_1,...,x_i)$ from $N(\mu, \sigma^2)$, $Z = \frac{\bar{x}-\mu}{\sigma / \sqrt{n}} $ will be a standard normal $N(0, 1)$, but $T = \frac{\bar{x}-\mu}{s / \sqrt{n}}$ follows the t-distribution instead. This is because $s$ is no longer just a parameter, but a distribution (more specifically, it follows a $\chi^2$ distribution).

Let's take a look ourselves! We'll generate some random samples from a normal distribution, and generate the values that fit the standard normal and standart t distribution.

In [109]:
import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy as np
In [110]:
# Distribution parameters
mu, sigma = 0,1
n = 5
Z = []
T = []
pvalue = []
S_all = []
xbar_all = []
for iter in range(5000):
    # drawing some samples
    samples = np.random.normal(mu, sigma, n)
    
    # calculate the sample mean and sample variance
    xbar = np.mean(samples)
    S = np.std(samples) # this isn't quite right, were getting a biased estimate!
    S = sum([(x-xbar)**2 for x in samples])/(n-1) # that's better
    
    # Let's calculate our standard random variables
    Z.append( (xbar-mu) / (sigma/np.sqrt(n)) )
    T.append( (xbar-mu) / (S/np.sqrt(n)) )
    
    # and track some other interesting variables of note
    pvalue.append(stats.norm.cdf((xbar-mu) / (sigma/np.sqrt(n)) ))
    S_all.append(S)
    xbar_all.append(xbar)
    
    
fig, ax = plt.subplots()
ax.hist(Z, alpha=.5, bins=np.linspace(-8,8,100), label='Z', density=True)
ax.hist(T, alpha=.5, bins=np.linspace(-8,8,100), label='T', density=True)
ax.set_ylabel('norm count')
ax.set_xlim(-8,8)
ax.legend()
Out[110]:
<matplotlib.legend.Legend at 0x157b77f0>

Hey! Looks like when we use sample variance instead of population variance, we get a different distirbution! Why is this? Let's look at the distribution of our sample mean and sample variance.

In [111]:
fig, (ax1,ax2) = plt.subplots(1,2)
b=ax1.hist(xbar_all, density=True, bins=100)
ax1.set_title('sample mean distribution')
ax1.set_ylabel('norm counts')
b=ax2.hist(S_all, density=True, bins=100)
b=ax2.set_title('sample variance distribution')

Ok, so looks like our sample mean follows a normal distribution. This isn't surprising, since we're sampling from a normal distribution and summing the results, and the sum of different normal variables is also a normal (proof). However, the sample variance seems to follow a different distribution. It turns out, if we square a standard normal random variable, we get a chi-square ($\chi^2$) random variable. And if you work through the math, the ratio of a Gaussian random variable divided by an independent chi-distributed random variable becomes a new random variable that follows what we call the t-distribution.

Finally, it is important to note that as our samples get larger, the t-distribution approaches a Gaussian.

In [112]:
fig, ax = plt.subplots()

ax.plot(x, stats.norm.pdf(x), lw = 2, alpha = 0.6, label='normal')
x = np.linspace(-5, 5, 100)
dfs = [1,2,10, 100]
for df in dfs:
    ax.plot(x, stats.t.pdf(x, df), lw=2, alpha=0.4, label='df = {}'.format(df))
ax.set_ylabel('normed counts')
ax.legend()
Out[112]:
<matplotlib.legend.Legend at 0x15796898>

Finally, for fun, let's verify what Sean has been telling us about these p-values being uniformly distributed. Recall that the p-value is $P(X < x | H)$ which is the area under the pdf of your distribution. Another way is to just look up the value on the CDF, which I have done above as:

stats.norm.cdf((xbar-mu) / (sigma/np.sqrt(n)) )

In [113]:
# make p value plots uniform from above
fig, ax = plt.subplots()
b=ax.hist(pvalue, density=True, bins=100)

What do we use the t-distribution for? Most commonly, it replaces the Gaussian distribution in hypothesis testing when we have a sample variance instead of a population variance.

Answer key

Some basic practice using bayes rule

2.

$$ \begin{aligned} P(R=pathogen | T=positive) &= \frac{P(T=positive | R=pathogen) P(R=pathogen)}{P(T=positive|R=sandmouse)P(sandmouse) + P(T=positive|R=pathogen)P(pathogen)}\\ &= \frac{.9 * .01}{.05*.99 + .9 * .01 } = .1538 \end{aligned} $$

Sun going nova

$$ \begin{aligned} P(sun=nova \; |\; detector=yes) &= \frac{P(detector=yes\; |\; sun=nova) P(sun=nova)}{P(detector=yes)} \\ &= \frac{P(detector=yes\; |\; sun=nova) P(sun=nova)}{P(detector=yes\;|\;sun=nova)P(sun=nova) + P(detector=yes\;|\;sun=normal)P(sun=normal)} \end{aligned} $$

Let's plug in numbers using what I got from Laplace's sunrise problem above

$$ \begin{aligned} P(sun=nova \; |\; detector=yes) &= \frac{(35/36) (1/8762)}{(35/36) (1/8762) + (1/36) (8761/8762)} \\ &= 0.0039790814 \end{aligned} $$

Hardy-Weinberg

$$ \begin{aligned} P(k_1, k_2, k_3 | p) &= \frac{1}{Z} p^{2 k_1} (2p(1-p))^{k_2} (1-p)^{2 k_3}\\ log P(k_1, k_2, k_3 | p) &= -logZ + 2k_1 log(p) + k_2 log(2) + k_2 log(p) + k_2 log(1-p) + 2 k_3log(1-p)\\ \frac{d}{dp}logP &= \frac{2k_1}{p} + \frac{k_2}{p} - \frac{k_2}{1-p} - \frac{2 k_3}{1-p} = 0\\ &\rightarrow \hat{p} = \frac{2k_1 + k_2}{2k_1 + 2k_2 + 2k_3} \end{aligned} $$