# Section 06: pvalues

*Notes & Jupyter Notebook Aoyue Mao (2021), adapted from Danylo Lavrentovich, Irina Shlosman, Kevin Mizes notes from past years*

# Week 06 section

In this week's lecture, we learned what p-values exactly mean. They tell us given some null hypothesis on how data is generated, how surprising a particular observation is.

This section will not focus on p-values. Instead, it will cover general features of parameter estimation, review Bayes rule, introduce the T distribution, and provide small hint for the p-set.

There is also a Jupyter notebook that covers all of the above more densely, along with practice problems, an example p value calculation, and some interactive functions that you can run/modify to explore null vs. alternative hypotheses, visualize estimation procedures, etc. Here's the notebook: w06_section.ipynb. To run the pre-written functions you need to download this helper file: w06_section_utils.py.

## Parameter Estimation/Inference

Given some data \(D=\{X_1,X_2,\ldots,X_i\}\), we'd like to describe the process that produced the data. Here are two example scenarios with different probability distributions.

### Example 1: Weights of Siberian huskies (Normal)

Weight is a continuous random variable. Based on our experience, it follows a normal distribution. Thus, the parameter estimation question is to ask: what particular \(\mu, \sigma\) describe the data \(D\)?

### Example 2: Number of heads out of \(N\) flips of a coin (Binomial)

Every time we flip a coin, we get a heads with probability \(p\), or a tails with probability \(1-p\). We flipped the coin \(N=30\) times. #heads is a discrete random variable and it follows the binomial distribution.

We got 25 heads. We didn't know whether the coin was fair or not. So we can ask: what particular \(p\) describes the number of heads?

### Notation

We use \(\theta\) to denote a hypothesis on how the data is generated. It is a set of particular values of the parameters of the underlying probability distribution. It may contain one or more parameters.

For dog weights (normal): \(\rm \theta=\{\mu=20\ kg,\sigma=4\ kg\}\);

For an unbiased coin (binomial): \(\theta=\{p=0.5\}\).

The big question we'd like to answer:

**Given data \(D\), what is the most probable \(\theta\)?**

### Reviewing Bayes rule

To pick out a most probable \(\theta\), let's first define a set of hypotheses. Suppose we have a set of \(M\) hypotheses, \(\theta_1, \dots, \theta_M\).

Given data, how probable is a particular hypothesis \(\theta_k\)?

This sounds like a conditional probability... which means it's time to stare at Bayes rule:

Let's review the terms:

- \(P(\theta_k \mid D)\) is the posterior. How probable is the hypothesis
*post*seeing the data? - \(P(D\mid \theta_k)\) is the likelihood. How probable is the data, given that the hypothesis is true?
- \(P(\theta_k)\) is the prior. How probable is the hypothesis,
*prior*to seeing the data? - \(P(D)\) is the marginal. Or it might be called a normalization factor. How probable is the data under all possible hypotheses?

### Reviewing marginalization

Here we've assumed there's a finite number of hypotheses. If we have an infinite number, you could see the marginal written as something like \(P(D) = \int P(D \mid \theta') P(\theta') d\theta'\).

### Practice Problem: Mosquito's Blood Preference

ABO blood types exist in different frequencies in the human population. (At least some say that) they also attract mosquitos differently.

Blood type | Frequency in population | Mosquito landing chance |
---|---|---|

A | 0.42 | 0.453 |

B | 0.10 | 0.569 |

AB | 0.04 | 0.480 |

O | 0.44 | 0.785 |

Compute the following terms.

- \(P(\text{O type})\)
- \(P(\text{AB type})\)
- \(P(\text{Landing}\mid \text{B type})\)
- \(P(\text{Not landing}\mid \text{A type})\)
- \(P(\text{Landing})\)
- \(P(\text{O type}\mid\text{Landing})\)
- \(P(\text{AB type}\mid\text{Not landing})\)

### Comparing two hypotheses

I forgot my exact blood type, but I'm sure it is either A or O. And I was bitten by a mosquito today. How is this reflected in the posteriors?

To compare these two hypotheses, I can take a ratio of their posterior probabilities. The denominator \(P(\text{Landing})\) cancels out.

Generally speaking, when we look for the hypothesis with the highest \(P(\theta \mid D)\), we can ignore the denominator, since it's unaffected by hypothesis.

- If we have some prior beliefs \(P(\theta)\), the \(\theta\) that maximizes \(P(D\mid \theta)P(\theta)\) is the maximum a posteriori (MAP) estimation.
- If we further assume uniform priors \(P(\theta)\) on all \(\theta\), then \(P(\theta \mid D) \propto P(D\mid \theta)\). Now, the \(\theta\) that maximizes \(P(D\mid \theta)\) (a.k.a. likelihood) is the
**maximum likelihood estimation (MLE)**.

## Concepts behind likelihood \(P(D \mid \theta)\) with binomial

Binomial process: we run \(n\) trials, with each trial having a success probability \(p\), and we count how many successes we get out of the \(n\) trials.

The probability mass function: (The probability of getting \(k\) successes)

The likelihood \(P(k=25 \mid n=30,p)\) depends on \(p\). We can do MLE analytically in this binomial case. A convenient trick that comes up frequently with this kind of problem is to maximize the log likelihood instead of the likelihood. This is okay to do because the logarithm is a monotonically increasing function (the \(x\) that maximizes \(f(x)\) also maximizes \(\log f(x)\)). Exponents come up a lot in statistics formulas, and taking the log helps to separate them out.

The log likelihood:

Now it's a little easier to take the derivative with respect to \(p\):

Setting it to zero to find our maximizer, \(\hat{p}\) (you should check it's a maximum by taking the second derivative too):

Rather unsurprisingly, if we observe \(k\) heads out of \(n\) flips, the maximum likelihood estimate of the probability of heads is just \(k/n\).

We can also plot \(\log L(p)\) against \(p\). Clearly, there exists a maximum.

We find the probability mass function curve for this MLE. It indeed has the most overlap with the observed data.

A few things to consider: - Does this mean the null hypothesis (fair coin) is invalidated? - Does this mean \(p=k/n\) has to be the correct model? - What if we had (non-uniform) prior beliefs, \(P(\theta)\)? Maybe we really tend to trust the coin is fair...?

## MLE for the normal distribution

Now let's switch gears to something a little more involved. Suppose we observe \(n\) data points \(x_1, \dots, x_n \sim N(\mu, \sigma)\). Just as we did for binomial distribution, let's try to maximize the likelihood of the observed data with respect to the parameters that describe it.

For a single point \(x\), the normal **probability density function** (pdf for continuous distributions; pmf for discrete distributions) is

The joint probability of \(n\) points, assuming they are all independently, identically distributed:

At this point it will be convenient to take the log again:

#### MLE of \(\mu\)

Multiplying both sides by \(\sigma\) and rearranging gives us our likelihood-maximizing \(\mu\):

The maximum likelihood estimate of the mean given some observed \(x_i\)'s is the sample mean.

#### MLE of \(\sigma^2\)

Taking derivative wrt \(\sigma^2\):

Skipping some steps here (it is a good exercise to do these calculations by hand), we arrive at our estimate:

This expression is not too surprising -- it's the average squared distance from the mean over all data points.

### Estimation of Gaussian parameters only given data... the road to the T-distribution

Let's say we observe some small number of data points \(x_1, \dots, x_n\) (around \(n=5\)) that we assume are generated by a Gaussian process, \(x_1, \dots, x_n \sim N(\mu, \sigma^2)\), and we'd like to use the data to estimate \(\mu\) and \(\sigma\).

From our analysis above, - the maximum likelihood estimator for the mean is the sample mean: \(\hat{\mu} = \bar{x} = \frac{1}{N} \sum_{i=1}^n x_i\) - the maximum likelihood estimator for the variance is: \(\hat{\sigma^2} = \frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2\)

Wait a second. The MLE for the variance includes \(\mu\), something we're trying to estimate with \(\bar{x}\)!

If we plug in \(\bar{x}\) for \(\mu\) to get a new estimator for the variance, \(\hat{\sigma^2} = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2\), it turns out that this estimator is biased. An estimator \(\hat{\theta}\) of a parameter \(\theta\) is a function of data and is biased if the expected value of \(\hat{\theta}\) over all possible data is not exactly \(\theta\). A small correction is needed to yield an unbiased estimator of the population variance: \(\frac{1}{n-1}\) rather than \(\frac{1}{n}\). This is the Bessel correction -- you can read more about it here and here.

So, given some data, an unbiased estimator of the population variance is the (Bessel-corrected) sample variance, \(S^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2\)

## Hypothesis testing on \(\bar{x}\)

Q: How likely is it that we'd get this particular \(\bar{x}\) value?

\(\bar{x}\) has a variance of \(\frac{\sigma^2}{n}\) (refer to the detailed Jupyter notebook for proof); so the "standard error of the mean" is \(\frac{\sigma}{\sqrt{n}}\). \(\bar{x}\) is then turned into a standard score:

- If \(\sigma\) is known, this is a Z score

The Z score also follows a normal distribution.

- If \(\sigma\) is unknown, we estimate it with \(s\) (sample standard deviation). This becomes a T score

with \(n-1\) degrees of freedom, and it no longer follows the nornal distribution. Instead, it follows a Student's t distribution with \(n-1\) degrees of freedom.

### Distributions of Z and T scores

https://www.scribbr.com/statistics/t-distribution/

- T distribution has fatter tails. It allows for \(\bar{x}\) to be "far away" from \(\mu\) because we had to estimate \(\sigma^2\) from the data.
- We could have estimated too small a \(\sigma^2\), which means the T score would be larger than it should be. Hence, more probability of big T scores.
- At large \(n\), we estimate \(\sigma^2\) well, so T distribution converges to Z distribution.

So T distributions can arise from estimation of an unknown \(\sigma^2\) given data. In the pset, we will marginalize over many potential \(\sigma^2\).

## Concepts for pset

Inferring \((\mu,\sigma)\) of a normal distribution. The true values are from finite choices. For each candidate \((\mu,\sigma)\), calculate the posterior \(P(\mu,\sigma \mid x_1, \ldots, x_n)\).