recitation section week 5: Probability Review Part 2 by James Xue

(If you're looking at this on the web - you can also download these notes as a Jupyter notebook file.)

Assume that $x_1, .., x_n$ are i.i.d (independently and identically distributed) drawn from an unknown bernoulli distribution with parameter $\theta$.

Recall that the bernoulli distribution is distributed as follows: $$ \begin{align} P(x \mid \theta) = \theta^{x}{(1-\theta)}^{1-x} \end{align} $$

We will assume that $\theta$ has a beta prior with parameters $\alpha$ and $\beta$. That is, $\theta \sim Beta(\alpha, \beta)$.

The beta distribution has the form $$Beta(\alpha, \beta) = \frac{\theta^{\alpha-1}(1-\theta)^{\beta-1} }{B(\alpha, \beta)}$$

where $$B(\alpha, \beta) = \int_0^1 \theta^{\alpha-1} (1-\theta)^{\beta-1} d\theta $$

This is what a beta distribution looks like for varying values of $\alpha$ and $\beta$ (pulled from wiki page):

Here we see that using the Beta distribution gives us a range of different priors that we can play with.

How do we derive $p(\theta|x_1, ..., x_n)$?

$$ \begin{align*} p(\theta | x_1,... x_n) = & \frac{p(x_1, ...., x_n | \ \theta)\ p(\theta)}{p(x_1, ...., x_n)} && \text{by bayes rule} \\ = & \frac{p(x_1, ...., x_n | \ \theta)\ p(\theta)}{\int p(x_1, ...., x_n, \theta)d\theta} && \text{by marginalization}\\ = & \frac{p(x_1, ...., x_n | \ \theta)\ p(\theta)}{\int p(x_1, ...., x_n | \ \theta)\ p(\theta)d\theta} && \text{by using the conditional formula p(A,B)=p(A|B)p(B)}\\ = & \frac{\prod_m p(x_m | \ \theta)\ p(\theta)}{\int\prod_m p(x_m | \ \theta)\ p(\theta)d\theta} && \text{by assumption of i.i.d draws } \\ \end{align*} $$

Now, we know $p(\theta)$, which is our Beta prior AND we know that $p(x|\theta)$ is our Bernoulli likelihood. If you work out the math... I will cheat and assume that I already did, $p(\theta | x_1,... x_n)=Beta(\alpha+\sum_{i=1}^{n} x_i, \beta+n-\sum_{i=1}^{n} x_i)$.

Notice that the prior distribution, $p(\theta)$ has the same distribtuion form as the posterior $p(\theta | x_1,... x_n)$ - that is both are beta distributions. Formally, in bayesian statistics this type of relationship is known as conjugacy.

Note that instead if $\theta$ did not have a continuous prior distribution, but a discrete distribution (such as over a grid of values), then the integral would become a summation -i.e. for the last line, we would have the following instead:

$$p(\theta \mid x_1,... x_n) =\frac{\prod_m p(x_m | \ \theta_i)\ p(\theta_i)}{\sum_i\prod_m p(x_m | \ \theta_i)\ p(\theta_i)} $$

Now, lets play around with our newly derived posterior!

In [103]:
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook

#we will first draw 100 samples 

#lets also use a Beta 1,1 prior, we will derive the posterior and graph it.
rand_samples=stats.bernoulli.rvs(theta, size=n)

#lets take a look at our random samples drawn
print("random samples:")


xAxisVals = np.linspace(stats.beta.ppf(0.01, newAlphaValue, newBetaValue), stats.beta.ppf(0.99, newAlphaValue, newBetaValue), 100)
plt.plot(xAxisVals, stats.beta.pdf(xAxisVals ,newAlphaValue , newBetaValue), '-' , label="Posterior")
plt.plot((0.3,0.3), (0,10),  '--', label="True Theta")
random samples:
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0
 1 0 1 0 1 1 0 1 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1
 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0]

Not bad! we see that the distribution is centered pretty close to the true parameter alpha.

In [94]:
#Lets use the specific alpha, beta vlaues shown in the graph above AND use a range of sample sizes from 10 to 10000

fig, ax=plt.subplots(5,5, sharex=True, sharey=True)

priorValues=[[0.5,0.5], [5,1], [1,3], [2,2], [2,5]]
sampleSizes=[3, 10, 50, 100, 500]

#draw the random samples
rand_samples=stats.bernoulli.rvs(theta, size=sampleSizes[len(sampleSizes)-1])

priorColorMap={0:'r', 1:'b', 2:'g', 3:'m', 4:'orange'}

for ind in range(len(sampleSizes)):
    #given sample size, extract the samples from the superset
    for ind2 in range(len(priorValues)):
        #derive the new alpha and beta value from the posterior form given above
        #plot the subfigure
        xAxisVals = np.linspace(stats.beta.ppf(0.01, newAlphaValue, newBetaValue), stats.beta.ppf(0.99, newAlphaValue, newBetaValue), 100)
        ax[ind, ind2].plot(xAxisVals, stats.beta.pdf(xAxisVals ,newAlphaValue , newBetaValue),color=str(color), linestyle='-' )
        #ax[ind, ind2].plot((0.3,0.3), (0,1000),  'k--', label="True Theta")
        ax[ind, ind2].axvline(x=0.3,  color='k', linestyle='--')
        #ax[ind, ind2].xaxis.set_visible(False)
        #ax[ind, ind2].yaxis.set_visible(False)
        #for setting the labels
        if ind2 == 0:
            ax[ind, ind2].set_ylabel("n={0}".format(n))
        if ind == len(ax)-1:
            ax[ind, ind2].set_xlabel("alpha={0}, beta={1}".format(priorAlphaValue, priorBetaValue))

A common question that arises in Bayesian inference is how important is the prior? - the short answer is that given enough data, the prior isn't very important.

Here, we see that with more and more data, the data "dominates" the prior - that is, under certain mathematical conditions, the posterior distribution will converge to a result centered on the true theta value independent of the prior distribution used. To see why theoretically, check out the Bernstein-von Mises theorem.

Last homework we used a 2nd order markov model to evaluate the probability of sequences, but what was its link to the full probability model?

Assume that $A_1A_2A_3A_4A_5$ is your sequence of interest. (that is $A_i$ can be a base A,T,C, or G). Now, we want to find the probability of specific sequences, $P(A_1A_2A_3A_4A_5)$, such as $P(ATCGC)$ or $P(AAAAA)$, how do we go about doing it? Lets try to derive the probability equation!

$$ \begin{align*} p( A_1,... A_5) = & p(A_5, A_4, A_3, A_2 \mid A_1)\ p(A_1)\\ = & p(A_5, A_4, A_3 \mid A_1, A_2)\ p(A_2 \mid A_1)\ p(A_1) \\ = & p(A_5, A_4 \mid A_1, A_2, A_3)\ p(A_3 \mid A_1, A_2)\ p(A_2 \mid A_1)\ p(A_1) \\ = & p(A_5 \mid A_1, A_2, A_3, A_4)\ p(A_4 \mid A_1, A_2, A_3) p(A_3 \mid A_1, A_2)\ p(A_2 \mid A_1)\ p(A_1) \\ \end{align*} $$

Note that going from line to line, we just repeatedly use the equation $P(X, Y)=P(X\mid Y)P(Y)$ on the term immediate right of the equals sign. For example to go from the first to second line, I used $p(A_5, A_4, A_3, A_2 \mid A_1)=p(A_5, A_4, A_3 \mid A_1, A_2)\ p(A_2 \mid A_1)$.

Now, woah! we kind of notice a pattern in the above equation, more generally, there is a concept known as the chain rule of probability:

$$p(A_1, A_2, ..., A_n)=\prod_{k=1}^{n}p(A_k|A_1, ..., A_{k-1})$$

The above can be proved by mathematical induction.

In the previous homework, we assumed a 2nd order markov model and made an assumption that the current base depends only on the immediate two bases beforehand, thus the last equation is approximated by the following:

$ p( A_1,... A_5) \approx p(A_5 \mid A_3, A_4)\ p(A_4 \mid A_2, A_3) p(A_3 \mid A_1, A_2)\ p(A_2 \mid A_1)\ p(A_1)$

Let $X_1, ...., X_n$ be independently drawn from a normal distribution with mean $\mu$ and variance $\sigma^2$.

Let $\bar{X}=\frac{1}{n} \sum_{i=1}^n X_i$

Let $S^{2}=\frac{1}{n-1} \sum_{i=1}^n (X_i-\bar{X})$

Sean mentioned in class that $\frac{\bar{X}-\mu}{S/\sqrt{n}}$ is t distributed with $n-1$ degrees of freedom. The t-distribution has the pdf $$\frac{\Gamma(\frac{\nu+1}{2})}{\sqrt{\nu\pi}\ \Gamma(\frac{\nu}{2})} (1+\frac{x^2}{v})^{-\frac{v+1}{2}} $$

This looks pretty complicated and notably, the T-distribution doesn't seem to depend on $\mu$ and $\sigma^2$ even though the original data was generated from a normal with these parameters. You, being the skeptic, decide to test out his claim:

In [107]:
#draw n random samples from a normal distribution with mean mu and variance

#lets also use a Beta 1,1 prior, we will derive the posterior and graph it.

#draw 1000 samples with mean and sd set before

for ind in range(numIterations):
    x_vals=np.random.normal(mu, sd, n)
    #calculate the numerator - Xbar minus mu
    #calculate the denominator
    S=np.sqrt(1/(n-1)*sum([(x_val_iter-xbar)**2 for x_val_iter in x_vals]))

print("What the data looks like:")
#plot the histogram the statistic just calculated and see how close it is to the t-distribution
plt.hist(all_statistics, alpha=0.5, bins=50, normed=True)
xAxisVals = np.linspace(stats.t.ppf(0.001, df), stats.t.ppf(0.999, df), 100)
plt.plot(xAxisVals, stats.t.pdf(xAxisVals, df), '-' , label="True T distribution")
What the data looks like:
[1.7066295893477978, 3.3342072827578781, -0.4799627556882175, -1.1409465147667479, 1.0693070933117783, 0.84861606588780913, -0.23508053386915861, 0.39006626473583011, -0.13111684191674525, -0.21979336482270478]

Ok, this looks pretty similar to a t-distrbution, you believe Sean more now. More importantly, note that as we increase the number of iterations the result converges more and more to look like a t-distribution and that this holds regardless of we set $n$ (the number of samples drawn from the normal in each iteration) - try playing around with different values of $n$!.

This is one way that the t-distribution arises. In the homework, you will see that the t-distribution also arises by a bayesian way of thinking: it is the posterior distribution of $\mu$ by marginalizing out the contribution of $\sigma^2$.

In [ ]: