MCB112: Biological Data Analysis (Fall 2017)

week 05:

the trouble with p-values

it all started so well

Given some observed data, it would be useful to know how surprising they are. Are these data consistent with what I’d expect by chance? If not, something more interesting might be going on.

Take Laplace’s question of the birth rate of boy vs. girls in Paris, for example. He observed 251,527 boys in 493,472 births. Is this 0.51 frequency surprisingly different from what we’d expect by chance?

To be quantitative, we have to specify exactly what chance means. We formulate a hypothesis called the “null hypothesis”, , so that we can calculate , the probability distribution over different data outcomes, given the null hypothesis.

In the Laplace example, the obvious null hypothesis is that boys and girls are equiprobable: . The possible data outcomes, in total births, are boys. The probability of any given count of boys is a binomial distribution .

As we’ve already noted last week, the probability of any specific outcome might be absurdly small, especially as gets large. A specific outcome can be unlikely (in the sense that you wouldn’t have bet on exactly that outcome beforehand), but unsurprising (in the sense that it’s one of many outcomes that are consistent with the null hypothesis). If , the probability of getting exactly 50% boys () is tiny, 0.001. But the probability of getting a number within 1000 of 246736 is more than 99%. (Which I just calculated using the binomial’s cumulative distribution function in Python, as you’re about to see below.)

Indeed, we can calculate a range of data that are consistent with the null hypothesis, even if any particular one outcome is unlikely, and then ask if our observed data is outside that plausible range. This requires that the data can be arranged in some sort of linear order, so that it makes sense to talk about a range, and about data outside that range. That’s true for our counts of boys , and it’s also true of a wide range of “statistics” that we might calculate to summarize a dataset (for example, the mean of a bunch of observations ).

For example, what’s the probability that we would observe boys or more in Laplace’s problem? For this, we use a cumulative probability function (CDF), the probability of getting a result of or less:

Our boy count is discrete (and only defined on ) so , which we can get in Python’s SciPy stats.binom module:

   c = 251527
   n = 493472
   p = 0.5
   1 - binom.cdf(c-1,n,p)

which gives us 1.1e-16.

computers are annoying

Ha! This number is wrong! The math is right, but computers are annoying. The number is so close to zero, we get garbage, from an immense floating-point rounding error. When numbers get very small, we have to worry about pesky details.

On a machine, in floating point math, for some small threshold . In double-precision floating-point math (what Python uses internally), the machine is 1.1e-16. This is the smallest relative unit of magnitude that two floating point numbers can differ by. The result of binom.cdf() is so close to 1 that the machine can’t keep track of the precision; it just leaves its return value at one epsilon less than 1, and gives us .

We have to make sure that we never try to represent if we know might be small; we need to use instead. Here that means we want SciPy to tell us 1 - CDF instead of the CDF. That’s got a name: the survival function, .sf() in SciPy. Let’s try again:

   c = 251527
   n = 493472
   p = 0.5

Now we get 1.2e-42, which is right. There’s a tiny probability that we’d observe 251,527 boys or more, if the boy-girl ratio is 50:50.

definition of a p-value

A p-value is the probability that we would have gotten a result at least this extreme, if the null hypothesis is true.

We get the p-value from a cumulative probability function , so it has to make sense to calculate a CDF. There has to be an order to the data, so “more extreme” is defined. Usually this means we’re representing the data as a single number: either the data is itself a number (, in the Laplace example), or a summary statistic like a mean.

For example, it wouldn’t make sense to talk about the p-value of the result of rolling a die times. The observed data are six values , and it’s not obvious how to order them. We could calculate the p-value of observing sixes out of rolls, though. Similarly, it wouldn’t make sense to talk about the p-value of a specific poker hand, but you could talk about the p-value of drawing a pair or better, because the value of a poker hand is orderable.

p-values are uniformly distributed on (0,1)

If the data were actually generated by the null hypothesis, and you did repeated experiments, calculating a p-value for each observed data sample, you would see that the p-value is uniformly distributed. By construction – simply because it’s a cumulative distribution! 5% of the time, if the null hypothesis is true, we’ll get a p-value of ; 50% of the time, you’ll get a p-value .

Understanding this uniform distribution is important because sometimes people say that a result with a p-value of 0.7 is “less significant” than a result with a p-value of 0.3, but in repeated samples from the null hypothesis, you will obtain the full range of possible p-values from 0..1 in a uniform distribution. Seeing a p-value of 0.7 is literally equally probable as seeing a p-value of 0.3, or 0.999, or 0.001. Indeed, seeing a uniform distribution is a good check that you’re calculating p-values correctly.

null hypothesis significance testing

P-values were introduced in the 1920’s by the biologist and statistician Ronald Fisher. He intended them to be used as a tool for detecting unusual results:

“Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.”

There are three important things in this passage. First, it introduced as a standard of scientific evidence. Second, Fisher recognized that this was a “low standard”. Third, by saying “rarely fails”, Fisher meant it to be used in the context of repeated experiments: a true effect should reproducibly be distinguishable from chance.

Many fields of science promptly forgot about the second two points, while adopting as a hard standard of scientific evidence. A result is said to be “statistically significant” if it achieves . Sometimes, contrary to everything Fisher intended, and contrary to logic, a single result with is publishable in some fields (cough cough, psychology TED talks, cough cough). How this travesty happened, nobody quite seems to know.

Nowadays there’s a backlash. Some people want to change 0.05 to 0.005, which rather misses the point. (Here, have a band-aid for your sucking chest wound.) Some people want to ban P-values altogether.

P-values are useful, as long as you’re using them the way Fisher intended. It’s useful to know when the observed data aren’t matching well to an expected null hypothesis, alerting you to the possibility that something else is going on. But 5% is a low standard – even if the null hypothesis is true, 5% of the time you’re going to get results with . You need to see your unusual result reproduce consistently before you’re going to believe in it.

Where you get into trouble is when you try to use a p-value as more than just a rule-of-thumb filter for potentially interesting results:

Using a p-value to test whether your favorite hypothesis is supported by the data is fundamentally illogical. A p-value test never even considers ; it only looks at the null hypothesis . “Your model is unlikely; therefore my model is right!” is not the way logic works.

multiple testing correction

Suppose you do test one million things. What do you need your p-value to be (per test), to decide that any positive result you get in tests is statistically significant?

Well, you expect false positives. The probability of obtaining one or more false positives is (by Poisson) . This is still a p-value, but with a different meaning, conditioned on the fact that we did tests: now we’re asking, what is the probability that we get result at least this extreme (at least one positive prediction), given the null hypothesis, when we do independent experiments? For small , , so the multiple-test-corrected p-value is approximately . That is, multiply your per-test p-value by the number of tests you did to get a “corrected p-value”. This simple idea is called a Bonferroni correction. It’s very conservative.

probability of what now?

Rather than the names of things, I’d rather that you remembered the principles. A lot of confusion stems from not knowing what probability we’re talking about. Someone will say “with a p of < 0.05”. Someone like me will say, probability of what? It’s actually sort of crazy that we just write down “p” like everyone’s supposed to know what it means. It’s important for me to understand what your null hypothesis is, for one thing. Also, there’s at least two major different probabilities in play, because of the multiple testing issue.

The per-test p value is (something like) : the probability that your test statistic would have been at least as extreme as , if the null hypothesis were true.

The “multiply corrected” p-value is (something like) : the probability that, if you ran independent samples from the null hypothesis and counted the number of times a sample gave a score exceeding threshold , you get at least one. Now the “test statistic” that we’re testing for extreme-ness is the count of tests being declared positives.

If you don’t explain this in your work, your reader doesn’t know what you’re talking about when you say “p < 0.05”.

One way to make it clear without a lot of jargon is just to explain where your expected false positives are. If you say you ran a screen on samples at a threshold of , I’m looking for where in your paper you put the 5% false positive predictions that you expected: there should’ve been about of them. I’ll feel better that you know what you’re doing if you acknowledge that they’re in your results somewhere. That is, don’t talk in terms of “statistical significance”; talk in terms of the number of expected false positives. If you made 100 positive predictions, you can say “under null hypothesis (whatever it is), we would expect xxx of these to be false.”

the false discovery rate (FDR)

One reason that the Bonferroni correction is conservative is the following. Suppose you run a genome-wide screen and you make 80,000 predictions. Do you really need all of them to be “statistically significant” on their own? That is, do you really need to know that the probability of even one false positive in that search is or whatever? More reasonably, you might say you’d like to know that 99% of your 80,000 results are true positives, and 1% or less of them are false positives.

Suppose you tested a million samples to get your 80,000 positives, at a per-test p-value threshold of . By the definition of the p-value you expected up to 50,000 false positives: maybe all the samples are in fact from the null hypothesis, and at a significance threshold , you expect 5% false positives. So if you trust your numbers, the remaining 30,000 samples are expected to be true positives. You could say that the expected fraction of false positives in your 80,000 positives is 50000/80000 = 62.5%.

This is called a false discovery rate calculation – specifically, it is called the Benjamini-Hochberg FDR.

The false discovery rate (FDR) is the proportion of your called “positives” that are expected to be false positives, given your p-value threshold, the number of samples you tested, and the number of positives that were “statistically significant”.

what Bayes says about p-values

A good way to see the issues with using p-values for hypothesis testing is to look at a Bayesian posterior probability calculation. Suppose we’re testing our favorite hypothesis against a null hypothesis , and we’ve collected some data . What’s the probability that is true? That’s its posterior:

To put numbers into this, we need to be able to calculate the likelihoods and , and we need to know how the priors and – how likely and were before the data arrived.

What does p-value testing give us? It gives us : the cumulative probability that some statistic of the data has a value at least as extreme as , under the null hypothesis.

We don’t know anything about how likely the data are under our hypothesis . We don’t know how likely or were in the first place. And we don’t even know , really, because all we know is a related cumulative probability function of and the data.

Therefore it is utterly impossible (in general) to calculate a Bayesian posterior probability, given a p-value – which means, a p-value simply cannot tell you how much evidence your data give in support of your hypothesis .

(This was the fancy way of saying that just because the data are unlikely given does not logically mean that must be true.)

what Nature (2014) said about p-values

The journal Nature ran a commentary article in 2014 called “Statistical errors”, about the fallacies of p-value testing. The article showed one figure, reproduced to the right. The figure shows how a p-value corresponds to a Bayesian posterior probability, under three different assumptions of the prior odds, for or . It shows, for example, a result of might be “very significant”, but the posterior probability of the null hypothesis might still be quite high: 30%, if the null hypothesis was pretty likely to be true to begin with. The commentary was trying to illustrate the point that the p-value is not a posterior probability, and that a “significant” p-value does not move the evidence as much as you might guess.

Figure from Nuzzo, "Statistical errors", Nature (2014), showing how a P-value affects a posterior probability. ('Open image in new tab' to embiggen.)

But hang on - I just finished telling you that it is utterly impossible (in general) to calculate a posterior probability from a p-value, and here we have Nature doing exactly that. What’s going on?

(The p-value literature is like this: a thousand righteous voices declaiming what’s wrong with p-values, while making things even more confusing by leaving out important details and/or making subtle errors. And I’m probably doing the same thing, even as I try not to!)

The key detail in the Nature commentary flashes by in a phrase – “According to one widely used calculation…”, and references a 2001 paper from statistican Steven Goodman. Let’s look at that calculation and see if we can understand it, and if we agree with its premises.

First we need some background information on statistical tests involving Gaussian distributions.

differences of means of Gaussian distributions

Suppose I’ve collected a dataset with a mean value of , and suppose I have reason to expect, in replicate experiments, that the mean is normally (Gaussian) distributed with a standard error SE. (I’ll explain “standard error” sometime soon - for now, it’s just the standard deviation of our observed means, in repeated experiments.) My null hypothesis is that the true mean is . I want to know if my observed is surprisingly far from – that is: what is the probability that I would have observed an absolute difference at least this large, if I were sampling from a Gaussian of mean and standard deviation ?

A Gaussian probability density function is defined as:

(And where’s this come from? Magic? Turns out that we can derive the Gaussian from first principles with a little calculus, if we assume we seek the least informative – i.e. maximum entropy – distribution that is constrained to have some mean and variance . If instead we only constrain to a mean , we derive the exponential distribution. We’ll leave this aside, and maybe get back to it later, since it’s fun to see.)

A useful thing to notice about Gaussian distributions is that they’re identical under a translation of and , and under a multiplicative rescaling of . The probability only depends on the ratio : that is, on the number of standard deviations away from the parametric mean. So, if I calculate a so-called Z-score:

then I can talk in terms of a simplified standard normal distribution:

This is a very useful simplification - among other things, it’ll be easier to remember things in units of “number of standard deviations away from the mean”, and will help you develop general, quantitative intuition for Gaussian-distributed quantities.

A p-value is then the probability that we get a Z score at least as extreme as :

We might be interested not just in whether our mean is surprisingly larger than , but smaller too. That’s the difference between what statisticians call a “one-tailed” test versus a “two-tailed” test. In a one-tailed test, I’m specifically testing whether , for example; in a two-tailed test, I’m testing the absolute value, . The Gaussian is symmetric, so .

We get from the Gaussian cumulative distribution function (CDF):

(Because this is a continuous function, asymptotically; there’s asymptotically zero mass exactly at .)

There’s no analytical expression for a Gaussian CDF, but it can be computed numerically. In Python, the scipy.stats.norm module includes a CDF method and more:

   from scipy.stats import norm
   z = 1.96
   1 - norm.cdf(z)      # gives one-tailed p-value P(Z >= 1.96) = 0.025
   norm.sf(z)           # 1-CDF(x) is the "survival function" 

   p = 0.05             # you can invert the survival function with `.isf()`:
   norm.isf(p)          # given a 1-tailed p-value P(Z > z), what Z do you need? (gives 1.64)
   norm.isf(p/2)        #    or for a two-tailed P(|Z| > z)  (gives 1.96)

Now we’ve got enough background to get back to the calculation that Goodman makes, that the Nature commentary is based on.

back to Goodman’s calculation

Crucially, for a Gaussian-distributed statistic, if I tell you the p-value, you can calculate the Z-score; and given the Z-score, you can calculate the likelihood . Thus in this case we can convert a P-value to a likelihood of the null hypothesis for our mean , in terms of :

For example, a p-value of for a two tailed test implies Z = 1.96. Roughly speaking, 5% of the time, we expect a result more than 2 standard deviations away from the mean on either side.

So now we’ve got . That’s one of our missing terms dealt with, in the Bayesian posterior probability calculation. How about ? That’s the tricksy one.

Observe that the best possible hypothesis is that its mean is exactly the observed mean of the data . If so, then , thus , so:

(In part it’s an upper bound because it’s cheating to choose this as our after we’ve looked at the data; we’d actually have to look at a range of possible values for , with a prior distribution. We’ll come back to this.)

Now we can calculate a bound on the likelihood odds ratio, the “Bayes factor” for the support for the null hypothesis:

This represents how much the relative odds in favor of the null hypothesis changes after we observed the data.

We still need to deal with the prior probabilities and . Recall that we can rearrange the posterior in terms of the Bayes factor and the prior odds:

Since we have a lower bound on the Bayes factor, we’re also going to get a lower bound on the posterior probability of the null hypothesis.

Now we’re ready to plug numbers in, to see what Goodman is doing.

Suppose and are 50:50 equiprobable a priori, and you observe a mean that is standard deviations away from the null hypothesis’ . The two-tailed P-value is 0.05. The minimum Bayes factor is . The posterior probability of the null hypothesis is no less than %. Obtaining our “significant” P-value of 0.05 only moved our confidence in the null hypothesis from 50% to 13%.

Now suppose that the null hypothesis is more likely a priori, with prior probability 95%. (In many biological experiments, we’re usually going to observe unsurprising, expected results.) Now the posterior probability of the null is %. Our “significant” P-value hardly means a thing – it’s still 74% probable that the null hypothesis is true. If this is how we did science – if we published all our “statistically significant” results, with only the p-value as evidence – 74% of our papers would be wrong.

This is the gist of it, though the numbers in Nuzzo’s 2014 Nature commentary are actually based on a second calculation that’s one step more sophisticated than this. Instead of saying that has exactly (which, as we said, is somewhat bogus, choosing your hypothesis to test after looking at the data), you can do a version of the calculation where you say that any is possible for , with a prior distribution symmetrically decreasing around . Under this calculation, the probability is lower, so the bound on the Bayes factor is higher – so the posterior probability of the null hypothesis is decreased even less, for a given p-value. Thus Nuzzo’s figure says “29% chance of no real effect” for the p=0.05/50:50 case and “89% chance of no real effect” for the p=0.05/95:5 case, instead of the 13% and 74% I just calculated. Table 1 in the Goodman paper shows both variants of the calculation, for a range of p-values.


Thus the exact numbers in Nuzzo’s 2014 Nature commentary turn out to depend on very specific assumptions – primarily, that the p-value comes from a hypothesis test where we’re asking if a Gaussian-distributed mean is different from an expected value. There are many other situations in which we would use p-values and hypothesis testing, and it remains true that in general, you cannot convert a p-value to a Bayesian posterior probability.

But the general direction of Goodman’s and Nuzzo’s arguments is correct, and it is useful to see a worked example. A “significant” or even a “highly significant” p-value does does not mean that the null hypothesis has been disproven – it can easily remain more probable than the alternative hypothesis.

further reading