### recitation section week 5: Probability Review Part 2 by James Xue¶

(If you're looking at this on the web - you can also download these notes as a Jupyter notebook file.)

Assume that $x_1, .., x_n$ are i.i.d (independently and identically distributed) drawn from an unknown bernoulli distribution with parameter $\theta$.

Recall that the bernoulli distribution is distributed as follows: \begin{align} P(x \mid \theta) = \theta^{x}{(1-\theta)}^{1-x} \end{align}

We will assume that $\theta$ has a beta prior with parameters $\alpha$ and $\beta$. That is, $\theta \sim Beta(\alpha, \beta)$.

The beta distribution has the form $$Beta(\alpha, \beta) = \frac{\theta^{\alpha-1}(1-\theta)^{\beta-1} }{B(\alpha, \beta)}$$

where $$B(\alpha, \beta) = \int_0^1 \theta^{\alpha-1} (1-\theta)^{\beta-1} d\theta$$

This is what a beta distribution looks like for varying values of $\alpha$ and $\beta$ (pulled from wiki page):

Here we see that using the Beta distribution gives us a range of different priors that we can play with.

How do we derive $p(\theta|x_1, ..., x_n)$?

\begin{align*} p(\theta | x_1,... x_n) = & \frac{p(x_1, ...., x_n | \ \theta)\ p(\theta)}{p(x_1, ...., x_n)} && \text{by bayes rule} \\ = & \frac{p(x_1, ...., x_n | \ \theta)\ p(\theta)}{\int p(x_1, ...., x_n, \theta)d\theta} && \text{by marginalization}\\ = & \frac{p(x_1, ...., x_n | \ \theta)\ p(\theta)}{\int p(x_1, ...., x_n | \ \theta)\ p(\theta)d\theta} && \text{by using the conditional formula p(A,B)=p(A|B)p(B)}\\ = & \frac{\prod_m p(x_m | \ \theta)\ p(\theta)}{\int\prod_m p(x_m | \ \theta)\ p(\theta)d\theta} && \text{by assumption of i.i.d draws } \\ \end{align*}

Now, we know $p(\theta)$, which is our Beta prior AND we know that $p(x|\theta)$ is our Bernoulli likelihood. If you work out the math... I will cheat and assume that I already did, $p(\theta | x_1,... x_n)=Beta(\alpha+\sum_{i=1}^{n} x_i, \beta+n-\sum_{i=1}^{n} x_i)$.

Notice that the prior distribution, $p(\theta)$ has the same distribtuion form as the posterior $p(\theta | x_1,... x_n)$ - that is both are beta distributions. Formally, in bayesian statistics this type of relationship is known as conjugacy. https://en.wikipedia.org/wiki/Conjugate_prior

Note that instead if $\theta$ did not have a continuous prior distribution, but a discrete distribution (such as over a grid of values), then the integral would become a summation -i.e. for the last line, we would have the following instead:

$$p(\theta \mid x_1,... x_n) =\frac{\prod_m p(x_m | \ \theta_i)\ p(\theta_i)}{\sum_i\prod_m p(x_m | \ \theta_i)\ p(\theta_i)}$$

Now, lets play around with our newly derived posterior!

In [103]:
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook

#we will first draw 100 samples
n=100
theta=0.3

#lets also use a Beta 1,1 prior, we will derive the posterior and graph it.
priorAlphaValue=1
priorBetaValue=1
rand_samples=stats.bernoulli.rvs(theta, size=n)

#lets take a look at our random samples drawn
print("random samples:")
print(rand_samples)

xsum=sum(rand_samples)
newAlphaValue=priorAlphaValue+xsum
newBetaValue=priorBetaValue+n-xsum

xAxisVals = np.linspace(stats.beta.ppf(0.01, newAlphaValue, newBetaValue), stats.beta.ppf(0.99, newAlphaValue, newBetaValue), 100)
plt.plot(xAxisVals, stats.beta.pdf(xAxisVals ,newAlphaValue , newBetaValue), '-' , label="Posterior")
plt.plot((0.3,0.3), (0,10),  '--', label="True Theta")
plt.legend()
plt.show()

random samples:
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0
1 0 1 0 1 1 0 1 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1
1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0]


Not bad! we see that the distribution is centered pretty close to the true parameter alpha.

In [94]:
#Lets use the specific alpha, beta vlaues shown in the graph above AND use a range of sample sizes from 10 to 10000

fig, ax=plt.subplots(5,5, sharex=True, sharey=True)

theta=0.3
priorValues=[[0.5,0.5], [5,1], [1,3], [2,2], [2,5]]
sampleSizes=[3, 10, 50, 100, 500]

#draw the random samples
rand_samples=stats.bernoulli.rvs(theta, size=sampleSizes[len(sampleSizes)-1])

priorColorMap={0:'r', 1:'b', 2:'g', 3:'m', 4:'orange'}

for ind in range(len(sampleSizes)):

#given sample size, extract the samples from the superset
n=sampleSizes[ind]
sample=rand_samples[0:n]

for ind2 in range(len(priorValues)):
priorAlphaValue=priorValues[ind2][0]
priorBetaValue=priorValues[ind2][1]

#derive the new alpha and beta value from the posterior form given above
xsum=sum(sample)
newAlphaValue=priorAlphaValue+xsum
newBetaValue=priorBetaValue+n-xsum

#plot the subfigure
color=priorColorMap[ind2]
xAxisVals = np.linspace(stats.beta.ppf(0.01, newAlphaValue, newBetaValue), stats.beta.ppf(0.99, newAlphaValue, newBetaValue), 100)
ax[ind, ind2].plot(xAxisVals, stats.beta.pdf(xAxisVals ,newAlphaValue , newBetaValue),color=str(color), linestyle='-' )
#ax[ind, ind2].plot((0.3,0.3), (0,1000),  'k--', label="True Theta")

ax[ind, ind2].axvline(x=0.3,  color='k', linestyle='--')
#ax[ind, ind2].xaxis.set_visible(False)
#ax[ind, ind2].yaxis.set_visible(False)
#for setting the labels
if ind2 == 0:
ax[ind, ind2].set_ylabel("n={0}".format(n))
if ind == len(ax)-1:
ax[ind, ind2].set_xlabel("alpha={0}, beta={1}".format(priorAlphaValue, priorBetaValue))

plt.show()


A common question that arises in Bayesian inference is how important is the prior? - the short answer is that given enough data, the prior isn't very important.

Here, we see that with more and more data, the data "dominates" the prior - that is, under certain mathematical conditions, the posterior distribution will converge to a result centered on the true theta value independent of the prior distribution used. To see why theoretically, check out the Bernstein-von Mises theorem. https://en.wikipedia.org/wiki/Bernstein%E2%80%93von_Mises_theorem

Last homework we used a 2nd order markov model to evaluate the probability of sequences, but what was its link to the full probability model?

Assume that $A_1A_2A_3A_4A_5$ is your sequence of interest. (that is $A_i$ can be a base A,T,C, or G). Now, we want to find the probability of specific sequences, $P(A_1A_2A_3A_4A_5)$, such as $P(ATCGC)$ or $P(AAAAA)$, how do we go about doing it? Lets try to derive the probability equation!

\begin{align*} p( A_1,... A_5) = & p(A_5, A_4, A_3, A_2 \mid A_1)\ p(A_1)\\ = & p(A_5, A_4, A_3 \mid A_1, A_2)\ p(A_2 \mid A_1)\ p(A_1) \\ = & p(A_5, A_4 \mid A_1, A_2, A_3)\ p(A_3 \mid A_1, A_2)\ p(A_2 \mid A_1)\ p(A_1) \\ = & p(A_5 \mid A_1, A_2, A_3, A_4)\ p(A_4 \mid A_1, A_2, A_3) p(A_3 \mid A_1, A_2)\ p(A_2 \mid A_1)\ p(A_1) \\ \end{align*}

Note that going from line to line, we just repeatedly use the equation $P(X, Y)=P(X\mid Y)P(Y)$ on the term immediate right of the equals sign. For example to go from the first to second line, I used $p(A_5, A_4, A_3, A_2 \mid A_1)=p(A_5, A_4, A_3 \mid A_1, A_2)\ p(A_2 \mid A_1)$.

Now, woah! we kind of notice a pattern in the above equation, more generally, there is a concept known as the chain rule of probability:

$$p(A_1, A_2, ..., A_n)=\prod_{k=1}^{n}p(A_k|A_1, ..., A_{k-1})$$

The above can be proved by mathematical induction.

In the previous homework, we assumed a 2nd order markov model and made an assumption that the current base depends only on the immediate two bases beforehand, thus the last equation is approximated by the following:

$p( A_1,... A_5) \approx p(A_5 \mid A_3, A_4)\ p(A_4 \mid A_2, A_3) p(A_3 \mid A_1, A_2)\ p(A_2 \mid A_1)\ p(A_1)$

Let $X_1, ...., X_n$ be independently drawn from a normal distribution with mean $\mu$ and variance $\sigma^2$.

Let $\bar{X}=\frac{1}{n} \sum_{i=1}^n X_i$

Let $S^{2}=\frac{1}{n-1} \sum_{i=1}^n (X_i-\bar{X})$

Sean mentioned in class that $\frac{\bar{X}-\mu}{S/\sqrt{n}}$ is t distributed with $n-1$ degrees of freedom. The t-distribution has the pdf $$\frac{\Gamma(\frac{\nu+1}{2})}{\sqrt{\nu\pi}\ \Gamma(\frac{\nu}{2})} (1+\frac{x^2}{v})^{-\frac{v+1}{2}}$$

This looks pretty complicated and notably, the T-distribution doesn't seem to depend on $\mu$ and $\sigma^2$ even though the original data was generated from a normal with these parameters. You, being the skeptic, decide to test out his claim:

In [107]:
#draw n random samples from a normal distribution with mean mu and variance

numIterations=10000
n=4
mu=10
sd=2
df=n-1
#lets also use a Beta 1,1 prior, we will derive the posterior and graph it.

#draw 1000 samples with mean and sd set before

all_statistics=[]
for ind in range(numIterations):
x_vals=np.random.normal(mu, sd, n)
xbar=np.mean(x_vals)
#calculate the numerator - Xbar minus mu
numerator=xbar-mu

#calculate the denominator
S=np.sqrt(1/(n-1)*sum([(x_val_iter-xbar)**2 for x_val_iter in x_vals]))

denominator=S/np.sqrt(n)

statistic=numerator/denominator

all_statistics.append(statistic)

print("What the data looks like:")
print(all_statistics[0:10])
plt.figure()
#plot the histogram the statistic just calculated and see how close it is to the t-distribution
plt.hist(all_statistics, alpha=0.5, bins=50, normed=True)
xAxisVals = np.linspace(stats.t.ppf(0.001, df), stats.t.ppf(0.999, df), 100)
plt.plot(xAxisVals, stats.t.pdf(xAxisVals, df), '-' , label="True T distribution")
plt.legend()
plt.show()

What the data looks like:
[1.7066295893477978, 3.3342072827578781, -0.4799627556882175, -1.1409465147667479, 1.0693070933117783, 0.84861606588780913, -0.23508053386915861, 0.39006626473583011, -0.13111684191674525, -0.21979336482270478]


Ok, this looks pretty similar to a t-distrbution, you believe Sean more now. More importantly, note that as we increase the number of iterations the result converges more and more to look like a t-distribution and that this holds regardless of we set $n$ (the number of samples drawn from the normal in each iteration) - try playing around with different values of $n$!.

This is one way that the t-distribution arises. In the homework, you will see that the t-distribution also arises by a bayesian way of thinking: it is the posterior distribution of $\mu$ by marginalizing out the contribution of $\sigma^2$.

In [ ]: