Week 6: Section

Irina Shlosman 2019

with special thanks to Kevin Mizes 2018 for inspiration on Z-score and t-statistic calculations

You can download these notes in Jupyter notebook .ipynb format.

On the agenda:

  • A little bit about p-values: pitfalls and challenges
  • Bayesian statistics revisited: calculating MLE
  • The T distribution and the intuition behind it
  • Pseudo-coding the advanced game
In [1]:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
from scipy.stats import binom
from scipy.stats import norm
from scipy.special import comb

Let's talk a little bit about p-values and what information they can - and, more importantly, cannot! - provide.

Here is a little scenario that we can play with to illustrate the point.

Imagine now that you are the editor-in-chief of a fashion magazine ToutChic based in your native Boston. Your heart yearns for Parisian fashion, and you do all you can to bring the latest trends from the City of Lights to the City of Red Sox. Despite all your efforts, the Bostonians are slow to become versed in French, misspell the name of your magazine - too chic? two chicks? - and refuse to stop wearing white pants after Labor Day.

A recent article from your main competitor, the fashion magazine called Runway, comes out, claiming that, in fact, Bostonians are at the top of their game. They report two important results:

  1. only 10% of the population cannot tell the difference between 'turquoise' and 'cerulean' - or blatantly claims that the difference does not exist!

  2. On average, Bostonians fail to switch from sweats to suits when they go out only 17 times a month, with a standard deviation of 5.

You have a much harsher view of reality, so you turn to science to investigate the levels of fashion illiteracy in Boston for yourself.

You carry out an online survey, and use as your criteria the following questions:

  1. "Have you worn white pants in the month of October?". This is your best attempt at a color-blind-friendly version of the question, and you are fairly certain that only someone with a poor sense of the appropriate could wear white pants outside of the Memorial-Labor Day window.

  2. "How many times this month did you wear sweatpants outside of the confines of your home?"

  3. Would you consider yourself fashionable? Evaluate your style on the 0-50 scale.

You collect your responses and find that:

For Question 1:

Out of $N$ = 978 participants,

$k$ = 129 self-reported as having worn white pants at least once.

You take your null hypothesis to be the previously reported statistic of p = 0.1, and assume that the data are binomially distributed.

For Question 2, you assume that the data are normally distributed, and under the null hypothesis have mu = 17, sigma = 5. From your questionnaire, you obtain the following array of values ${x_1...x_N} $ where $N$ = 834

In [2]:
mu = 25
sigma = 7
size=834
resp = np.random.normal(mu,sigma,size)

For Question 3, unfortunately, your response pool is pretty sparse, because most participants were so offended by the question that they just write "Of course!" without bothering to evaluate themselves. Only a few gave numerical responses, but for this highly meticulous population, the answers were unnecessarily precise - 47.8, 32.33 etc

P-values

To get a quick idea of where your data lie with respect to the null hypothesis, you first quickly plot the null-hypothesis distribution and your result on the same graph.

For your first binary question:

In [3]:
n = 978
p = 0.1
k = 129
ar = np.arange(0,n)
In [4]:
figure, ax = plt.subplots(1,2, sharex=True, figsize=(10,4))
binom_pdf = np.zeros(len(ar))
binom_cdf = np.zeros(len(ar))
for i in ar:
    binom_pdf[i] = comb(n,i) * p**i * (1-p)**(n-i)
    binom_cdf[i] = binom.cdf(i,n,p)
ax[0].plot(ar, binom_pdf, 'k')
ax[0].axvline (x=k, color='r')
ax[0].set_xlim((0,300))
ax[0].set_xlabel('Number of people in white pants')
ax[0].set_ylabel('P under the null hypothesis')
ax[0].set_title('PDF', fontsize=15)
ax[0].legend(['PDF|H0', 'Observed'])


ax[1].plot(ar, binom_cdf,'k')
ax[1].axvline (x=k, color='r')
ax[1].set_xlabel('Number of people in white pants')
ax[1].set_ylabel('Cumulative P under the null hypothesis')
ax[1].set_title('CDF', fontsize=15)
ax[1].legend(['CDF|H0', 'Observed'])
Out[4]:
<matplotlib.legend.Legend at 0x1a1712e908>

Just taking a look at this distribution confirms your ideas - clearly, p cannot be 10%. But you are a careful scientist and don't trust anything but numbers, so you decide to calculate the p-value of your result to assess its significance.

The p-value is the probability that you will observe a value at least that extreme, given the null hypothesis:

p-value = $P(x \geq k | H0)$

In other words, it is the area under the PDF curve to the right of the red line in the graph above.

Alternatively, if you are looking at the CDF curve, it is the difference between maximum value of the function (i.e. 1) and the value at the red line.

In python language, you calculate your p-value to be:

In [5]:
binom.sf(k,n,p)
Out[5]:
0.0005783768321020854
In [6]:
1-binom.cdf(k,n,p)
Out[6]:
0.0005783768321020455

A p-value of ~0.0005 seems to you very significant, so you decide to issue an article in ToutChic summarizing your results. For that purpose, you'd like to formulate your own hypothesis about the levels of fashion illiteracy, since the null hypothesis is "clearly" wrong.

Rephrasing the question in the language of Bayesian statistics, you ask yourself 'What is the probability that best explains my data?' and set out to calculate the Maximum Likelihood Estimate for the binomial probability distribution.

Intuitively, you think that the result is probably just:

$\hat{p} = \frac{k}{N}$

But you'd like to go through the math carefully before you jump to any conclusions:

MLE for a binomial distribution

You want to find the value $p$ that maximizes the likelihood of the observed data, i.e.

$P(Data|p) = P(k,N|p) = {N \choose k}p^k (1-p)^{N-k} $

Which involves taking a partial derivative of $P(Data|p)$ with respect to $p$ and setting it to 0:

$\frac{d}{dp}P(k,N|P) = kp^{k-1}(1-p)^{N-k} - p^k{N-k}(1-p)^{N-k-1} $

This is a hairy mess, so you switch to operating with logarithms instead - maximizing the log likelihood should be the same as maximizing the likelihood, since logarithm is a monotonically increasing function.

$logP(k,N|P) = log{N \choose k}+klog(p)+(N-k)log(1-p) $

$\frac{d}{dp}logP(k,N|P) = \frac{k}{p} - \frac{N-k}{1-p} $

Setting this to zero and solving for $\hat{p}$

$\frac{k}{p} - \frac{N-k}{1-p} = 0 $

$k-kp = Np - kp$

gives:

$\hat{p} = \frac{k}{N}$

As you suspected all along. You calculate the reassessed value of $p$, and plot your $H1$ distribution on the same plot as $H0$ for comparison, claiming that the new distribution predicts your data much better, and, therefore, must be true.

In [7]:
p_h1 = k/n
p_h1
Out[7]:
0.13190184049079753
In [8]:
pnew = np.zeros(len(ar))
#p_h1 = np.zeros(len(ar))
for i in ar:
    pnew[i] = comb(n,i) * p_h1**i * (1-p_h1)**(n-i)
plt.plot(ar, binom_pdf, 'k')
plt.plot(ar, pnew, 'r')
#plt.plot(ar, pnull_alt, 'k--')
plt.axvline (x=k, color='k', linestyle='--')
plt.xlabel('Number of people in white pants')
plt.ylabel('PDF')
plt.legend(['H0', 'new H1'])
plt.xlim((0,300))
Out[8]:
(0, 300)

Let's pause here to chat quickly about the flow of logic in these conclusions.

  • Does the fact that our data reject the null hypothesis automagically validate H1?

  • H1 PDF describes this particular experiment best (because it's the MLE!). Does it mean that it is the correct model?

  • How many alternative distributions are potentially consistent with our observed data?

  • What happens if our null hypothesis is poorly formulated or incorrect? Or our experiment does not directly test the null hypothesis?

e.g. you find out that the fall of 2019 has been so hot that even people with excellent taste in clothing would wear white pants in October, if they were going to a pineapple grilling party. Which effectively brought up the proportion of white-pant wearers to p = 0.13, but did not change the proportion of unfashionable people (still p = 0.1).

  • Finally, what if the priors on your two hypotheses - H0 and H1 - were not uniformly distributed?

What about your second result?

In [9]:
mu_h0 = 17
sigma_h0 = 5
x_range = np.arange(0,50,0.5)
gauss = norm.pdf(x_range,mu_h0, sigma_h0)
plt.plot(x_range, gauss,'k')
plt.hist(np.random.normal(mu,sigma,size),bins=25, density=True)
plt.legend(['H0 PDF','Data'])
Out[9]:
<matplotlib.legend.Legend at 0x1a1aaf6cc0>

Your histogram is clearly distinct from the distribution plotted with the values reported by the Runway magazine.

Again, you set out to calculate the MLE estimate of the parameters (sigma, and mu) that would explain your data best.

MLE for a normal distribution

Just like with the binomial distribution, you would like to maximize the likelihood of the observed data with respect to the parameters that describe it.

For a single point:

$P(x_i|par) = P(x_i | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} exp(-\frac{(x_i-\mu)^2}{2\sigma^2})$

However, we observed $N$ points, generated independently, so the total probability of the data is:

$P(x | \mu, \sigma^2) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi\sigma^2}} exp(-\frac{(x_i-\mu)^2}{2\sigma^2}) = \frac{1}{\sqrt{2\pi\sigma^2}^N} exp(- \sum_{i=1}^N \frac{(x_i-\mu)^2}{2\sigma^2}) $

Converting everything to log form:

$log P(x| \mu, \sigma^2) = -\frac{N}{2}log(2\pi\sigma^2) - \sum_{i=1}^N \frac{(x_i-\mu)^2}{2\sigma^2} $

  • Finding MLE of $\mu$:

$\frac{d}{d\mu} log L(x|\mu, \sigma^2) = \sum_{i=1}^N \frac{(x_i-\mu)}{\sigma^2} = 0 $ $\rightarrow \sum_{i=1}^N x_i = N \mu$

$\hat{\mu} = \frac{\sum_{i=1}^N x_i}{N}$

  • Finding MLE of $\sigma^2$:

$\frac{d}{d\sigma^2} log L(x|\mu, \sigma^2) = -\frac{N}{2\sigma^2} + \sum_{i=1}^N \frac{(x_i-\mu)^2}{2(\sigma^2)^2} = 0 $

$\hat{\sigma^2} = \frac{1}{N}\sum_{i=1}^N (x_i-\mu)^2$

Again, as expected! You then calculate your MLE parameters and get:

In [10]:
mu_h1 = np.mean(resp)
sigma_h1 = np.sqrt((1/len(resp))*np.sum([(x-mu)**2 for x in resp]))

Note that in the calculation above we compute sigma from our known true $\mu$, not from $\mu$ MLE

Since I cheated and generated samples from a gaussian of known $\mu$ and $\sigma$, we can compare directly the accuracy of our MLE estimate:

In [11]:
print('MLE estimate of mu: ', mu_h1, '\n','True mu: ', mu)
print('MLE estimate of sigma: ', sigma_h1, '\n','True sigma: ', sigma)
MLE estimate of mu:  24.82987125541501 
 True mu:  25
MLE estimate of sigma:  6.63440351220052 
 True sigma:  7

What would you do, though, if you didn't have the true sigma or mu?

You'd just have to calculate them both directly from the data.

Your $\mu$ should be the same, but the estimated $\sigma^2$ now becomes:

$\hat{\sigma^2} = \frac{1}{N}\sum_{i=1}^N (x_i-\bar x)^2$

In [12]:
sigma_est = np.sqrt((1/len(resp))*np.sum([(x-mu_h1)**2 for x in resp]))
In [13]:
print('Estimate of sigma: ', sigma_est, '\n','True sigma: ', sigma)
Estimate of sigma:  6.632221812708357 
 True sigma:  7

Both are pretty good (because our we are looking at population-size samples, and $\bar x$ ~ $\mu$)

Finally, having completed all your laborious calculations, you finally get to Question 3. You wonder if it is even useful to consider the responses that you got, because there are only 5 of them.

In [14]:
resp_q3 = np.random.normal(25, 12,5)
print(resp_q3)
np.savetxt('resp_q3.txt', resp_q3)
[ 5.66182443  8.87480914 19.99917747 19.95596925  9.96964368]

You have an idea that these values come from a Gaussian distribution, so you attempt to calculate the parameters from the samples directly:

In [15]:
print(np.mean(resp_q3))
print(np.std(resp_q3))
12.892284794853419
5.955922436228044

The T-distribution

However directly trying to fit a Gaussian distribution to a small number of samples can give you inaccurate parameter estimates! In fact, the values are quite a bit off...

You are aware of this, and after a brief perusal of the web, zero in on the wiki page that seems highly relevant for your analysis:

Student's t-distribution

From there you learn, that the samples follow instead a t-distribution, which accounts for the inherent variance in the sample population that the gaussian distribution does not model.

The t-distribution has the same mean as the MLE $\mu$ for the Gaussian:

$\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i$

But a different $\sigma^2$:

$\bar{s^2} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 $

Two things to notice here:

  1. the normalization factor $\frac{1}{n} \rightarrow \frac{1}{n-1}$ For the explanation of where n-1 comes from, see Bessel correction
  2. the calculation of the variance now involves the sample-determined $\bar{x}$

Following wiki's very helpful section How Student's distribution arises from sampling, let's look at the difference between the distribution of the sample variance and the population variance.

In pseudo-code:

  • Generate a small random sample n=5 from a gaussian with parameters $\mu$ and $\sigma^2$
  • Calculate the mean of the sample, by taking the average over all points $\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i$

  • Calculate the variance of the sample, assuming t-distribution (using $\bar x$) $\bar{S^2} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 $

  • Calculate the Z score and the t-statistic:

For normal: $\frac{\bar x - \mu}{\sigma\sqrt n} $

For t: $\frac{\bar x - \mu}{\bar S\sqrt n} $

  • Repeat 1000 times.
In [16]:
# Distribution parameters
mu, sigma = 0,1
n = 5
Z = []
T = []
S_all = []
xbar_all = []
for i in range(10000):
    # drawing some samples
    samples = np.random.normal(mu, sigma, n)
    
    # calculate the sample mean and sample variance
    xbar = np.mean(samples)
    S = np.sum([(x-xbar)**2 for x in samples])/(n-1) 
    
    # calculate our standard random variables
    Z.append( (xbar-mu) / (sigma/np.sqrt(n)) )
    T.append( (xbar-mu) / (S/np.sqrt(n)) )
    
    S_all.append(S)
    xbar_all.append(xbar)
    
    
fig, ax = plt.subplots()
ax.hist(Z, alpha=.3, bins=np.linspace(-10,10,100), label='Z', density=True)
ax.hist(T, alpha=.3, bins=np.linspace(-10,10,100), label='T', density=True)
ax.set_ylabel('Probability')
ax.set_xlim(-8,8)
ax.legend()
Out[16]:
<matplotlib.legend.Legend at 0x1a1ab91668>

A standard score is a measure of "the number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured" (wiki, standard score). More simply put, both the Z-score and the t-statistic tell you how widely the values are distributed in the population or a sample.

We notice that the distributions of the Z-score and the t-statistic have different shapes: the t-distribution is "thinner" around the mean, and "fatter" at the tails than a gaussian.

How does the t-distribution arise? Let's develop a little bit of a mathematical intuition about it.

First, let's take a closer look at what we are calculating for the Z-score and the t-statistic.

Z-score:

$\frac{\bar x - \mu}{\sigma\sqrt n} $

Involves calculating the mean of the sample($\bar x$) of normally distributed variables, and then scaling that value by a constant $\sigma\sqrt n$.

t-statistic:

$\frac{\bar x - \mu}{\bar S\sqrt n} $

Involves calculating the mean of the sample of normally distributed variables, and then dividing that value by $S$, which is not a constant.

Let's take a look at the distribution of S and $\bar x$

In [17]:
fig, ax = plt.subplots(1,2, sharey=True)
ax[0].hist(xbar_all, density=True, bins=np.linspace(-3,3,100))
ax[0].set_title('Sample mean distribution')
ax[0].set_ylabel('P')
ax[1].hist(S_all, density=True, bins=np.linspace(-3,3,100))
ax[1].set_title('Sample variance distribution')
Out[17]:
Text(0.5,1,'Sample variance distribution')

Interesting... The sample mean is gaussian distributed, but the sample variance is not. Why is that - we're drawing the samples from a gaussian, after all...

To get the sample mean, we essentially sum up gaussian-distributed values, and then divide them by a constant. A sum of gaussian random variables is itself a gaussian variable (like all good things, you can find the proof on wiki), from which follows that both the sample mean and the Z-score should be gaussian-distributed. And they are!

Calculating S involves squaring the normal random variable, which gives a $\chi^2$ distribution instead(Chi-squared distribution). Dividing a gaussian distributed variable by a $\chi^2$ distributed variable results in a variable that follows a t-distribution.

Phew! We're finally there. All these mathematical subtleties aside, what we now can clearly see is that the t-distribution gives more weight to outliers at both extremes than a gaussian distribution and thus captures the reality of the variance observed with small sample sizes better. However, if we were to attempt to use the t-distribution sample variance to predict the variance of the underlying gaussian-distributed population, we would overestimate it.

Pseudo-coding the advanced game

Ok, with these considerations in the back of our mind, let's pseudo-code our approach to finding the parameters that best describe our data using Bayesian statistics (...this might come in helpful for Problem 2 on the pset):

  • With n observations $x_1...x_n$ (for us, it's n=5)
  • Calculate the posterior probability of each set of parameters $\mu_i$ and $\sigma_j$, given the samples observed

For one point and one set of parameters:

$P(\mu_i, \sigma_j|x_k) = \frac{P(x_k| \mu_i, \sigma_j)p(\mu_i, \sigma_j)}{\sum_{i,j}P(x_{1},...,x_{n}|\mu_{i},\sigma_{j})p(\mu_{i},\sigma_{j})}$

Let' assume that the prior is uniform for all parameter values, so we can just cancel it out - great!

Since the points are generated independently, to calculate $P(\mu_i, \sigma_j|X)$ we will need to compute the product of individual $x_k$ probabilities. Or, more conveniently, if we operate with logs, the sum of their individual log probabilities.

$\log P(\mu_{i},\sigma_{j}|X) = \log \frac{P(X|\mu_{i},\sigma_{j})}{\sum_{i,j}P(X|\mu_{i},\sigma_{j})} = \log P(X|\mu_{i},\sigma_{j}) - \log \sum_{i,j}P(X|\mu_{i},\sigma_{j})$

  • To calculate the likelihood (1st) term: $\log P(X|\mu_{i},\sigma_{j})$

Loop through our mean-variance grid, calculating for each cell in the grid $\sum_{k}\log P(x_k|\mu_{i},\sigma_{j})$

(hint: something like np.sum(stats.norm.logpdf(X,loc=mu,scale=sigma)))

  • To calculate the normalization (2nd) term: $\log \sum_{i,j}P(X|\mu_{i},\sigma_{j})$

Well, we already have all the individual $\log P(X|\mu_{i},\sigma_{j})$

So, we can exponentiate them out to get $P(X|\mu_{i},\sigma_{j}) = e^{\log P(X|\mu_{i},\sigma_{j})}$

Sum these probabilities and take the logarithm again:

$\log \sum_{i,j}P(X|\mu_{i},\sigma_{j}) = \log \sum_{i,j}e^{\log P(X|\mu_{i},\sigma_{j})}$

(hint: A convenient python function for this is log sum exp)

So, our posteriors for each cell of the grid should be exponentiated (1st term - 2nd term)

In [18]:
# load the calculated probability grip
model_prob = np.loadtxt('model_prob.txt')

# for seaborn heatmap
mu_grid = np.linspace(1,50,20)
sigma_grid = np.linspace(1,50,20)[::-1]
In [19]:
import seaborn as sns
sns.heatmap(data=model_prob.T,xticklabels=mu_grid.astype(int),\
            yticklabels=sigma_grid.astype(int),cmap="Blues")
plt.xlabel("$\mu$")
plt.ylabel("$\sigma$")
plt.show()

You will notice now that our prediction is now much better! Yay!

In [ ]: