2018, by Kevin Mizes

You can also download this page in jupyter notebook format.

Let's go more in depth into some of these p-value corrections and alternative statistics, and try to understand how they are calculated and why we might use them, while avoiding some of the more scary math symbols that shows up in these papers / wikipedia.

As Sean has gone through in his lectures, we need some way to account for errors in our tests. With the false discovery rate (FDR), we can take into account the fact that a certain number of our tests will be incorrectly discovered. By definition, the FDR is defined as the proportion (not rate) of false discoveries among the total number of discoveries, where a discovery is a test that has rejected the null hypothesis.

We collect the data below (reproduced from lecture)

Condition 1 - Null hypothesis true | Condition 2 - Alternative hypothesis true | Total | |
---|---|---|---|

Test is significant (discovery) | $x$ | $r-x$ | $r$ |

Test is not significant (no discovery) | $n-x$ | $m-(r-x)$ | $m+n-r$ |

Total | $n$ | $m$ | $m+n$ |

By definition, we have:

$$p value = \frac{x}{n}$$ (probability of rejecting the null hypothesis if it is true). We also have $$FDR = \frac{x}{r}$$

Recall that these varaibles can also go under the names true positive (TP), false positive (fp), true negative (tn), false negative (fn).

Condition 1 - Null hypothesis true | Condition 2 - Alternative hypothesis true | |
---|---|---|

Test is significant (discovery) | FP | TP |

Test is not significant (no discovery) | TN | FN |

Which then tells us $FDR = \frac{FP}{FP+TP}$

For example, let's say we have some heart rate data, and want to detect if someone has Atrial fibrillation (Afib). Let's say our test is 95% accurate, but lifetime risk of Afib is ~25%. We test have data from 1000 heartbeats and our table looks something like:

Healthy heartbeat | Afib | Total | |
---|---|---|---|

Test detects Afib | 38 | 225 | 263 |

Test detects healthy | 712 | 25 | 737 |

Total | 750 | 250 | 1000 |

The FDR is 38/263 ~= 0.144

In the Bayesian interpretation, we simply add a conditional dependence: what is the posterior probability that the null hypothesis is true ($H_0=1$), given the test returns a discovery? We just use Bayes' rule:

$$P(H_0=1 | T=+) = \frac{P(T=+ | H_0=1) P(H_0=1)}{P(T=+)} = \frac{p_i \; P(H_0=1)}{p_i \; P(H_0=1) + (1-p_i) (1-P(H_0=1))}$$

By definition, our p-value is the probability our test is positive given the null hypothesis is true. Let's plug in numbers for heart disease.

$$ \begin{aligned} P(Healthy \; |\; Test = Afid) &= \frac{P(Test = Afib \;|\; Healthy) P(Healthy)}{P(Test = Afib \;|\; Healthy) P(Healthy) + P(Test = Afib \;|\; Not \;healthy) P(Not\; healthy)}\\ &= \frac{38/750 * 750/1000}{38/750 * 750/1000 + 225/250 * 250/1000} = 0.144 \end{aligned} $$

Benjamini and Hochberg formalized their false discovery rate as the expected value of $x/R$, the proportion of false discoveries in total discoveries. However, they noted there was some problem if there were no discoveries, and so more formally, they expand to the conditional probability (or expected value here).

$$ FDR = E[\frac{x}{R} \; | \; R > 0] P(R > 0) $$

However, this is distinct from what another researcher, John Storey defines as the positive FDR

$$ pFDR = E[\frac{x}{R} \; | \; R > ]$$

The first equation is the proportion which false discoveries occur, and the second is the proportion that discoveries are false.

For example, you can set an FDR = 0.1, so that only 10% of your discoveries that occur are false, but if the probability of even having a discover is 0.5, then your pFDR threshold is 0.2 (the proportion of false discoveries).

With most of the data in genomics, $P(R>0)\approx 1$ so both estimates are approximately equal, so I don't think I'm going to discuss it more than this.

An issue arises when we have multiple hypotheses and a number of p-values. What if we tested for 1000 different heart diseases with this heart rate data? Some p-values (by definition) should be less than our significance threshold, so we need to control for the fact that we will see a number of false positives (sometimes called the look-elsewhere effect, which can lead to misleading news stories about how chocolate is good for you).

For multiple tests, we'll have a familywise error rate (FWER). For some threshold of significance $\alpha$ (Fisher set to 0.05). To correct for $m$ tests, we must accept or reject at a significance level $\frac{\alpha}{m}$. Or, our adjusted p-value becomes:

$$\widetilde{p} = m * p$$

For example, if we test 20 heart rates, what is the probability that due to chance, we'll observe at least one case of Afib from healthy data? We've calibrated our test so that it will return a false positive 5% of the time, (95% accurate).

$$ \begin{aligned} P(discovery \geq 1 \;|\; healthy_{1:20}) &= 1-P(no discovery < 1 \;|\; healthy_{1:20})\\ &= 1 - (.95)^{20} = .641 \end{aligned} $$

With the Bonferroni correction, we must recalibrate our test so that the false positive only occurs 5%/20 = .25% of the time. (99.75% accurate!)

$$ \begin{aligned} P(discovery \geq 1 \;|\; healthy_{1:20}) &= 1-P(no discovery < 1 \;|\; healthy_{1:20})\\ &= 1 - (.9975)^{20} = .0488 \end{aligned} $$

We can look at this more generally too. If we want to control for the fact that out of $n$ tests, there is only a 5% chance that there will be an accidental false discovery, we would write down:

$$ \begin{aligned} P(discovery \geq 1 \;|\; healthy_{1:n}) &= 1-P(no discovery < 1 \;|\; healthy_{1:n})\\ &= 1 - (1 - P(discovery < 1 \;|\; healthy))^n \\ &= 1 - (1 - \frac{0.05}{n})^n \end{aligned} $$

Which we can plot below.

In [31]:

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
```

In [33]:

```
%matplotlib inline
n = np.arange(1,10)
alpha = 0.05
P = 1-(1-alpha)**n
PBonferroni = 1 - (1-alpha/n)**n
plt.plot(n,P,'r',label='no correction')
plt.plot(n,PBonferroni, 'b', label='Bonferroni')
plt.xlabel('n')
plt.ylabel('Prob')
plt.title('Prob of at least 1 discovery among healthy patients')
plt.show()
```

The Bonferroni correction, while protects rejecting the null hypothesis when it really is true, is conservative, and does lead us to accept the null hypothesis more when it really is false. In Holm's method, instead of scaling our significance level by number of tests $m$, we scale by number of tests where the p-value is greater than our current p-value. Thus, for $j=1:m$, $\tilde{p_j} = (m-j+1) * p_j$

What makes this a step down proceudre is how we reject the null hypotheses. Starting at $j=0$, we adjust the p-value, and then make a comparison to our significance value $\alpha$. If $\tilde{p_j} \leq \alpha$, we continue, otherwise we stop, and reject ever null hypotheses $\leq j$ and accept those $ > j$

Let's say we're our significance threshold is 0.05, and we run four tests, giving p-values of 0.0121, 0.0142, 0.0391, and 0.1986. Using Holm's method, we first test $p_1$, $0.0121 < 0.05/4 = 0.0125$ and reject the null hypothesis. Since we have no rejected the hypothesis for observation 1, we only want to consider the remaining 3 hypotheses, and for our next comparison, we check $0.0142 < 0.05/3 = 0.0167$ which is also rejected. Next we compare $0.0391 < .05/2 = 0.025$. After this comparison we stop, and reject the null for our first two tests, but not for the second two.

This is another correction very similar to Holm's method, which basically runs the same process in the other direction.

From $j=m:1$, $\tilde{p_j} = (m-j+1) * p_j$. If $\tilde{p_j} < \alpha$, reject hypotheses $1:j$, else continue.

Going from this direction, Hochberg's will reject more null hypotheses than Holm's.

In [2]:

```
# Let's code up the Holm's step up and Hochberg step down
p = [0.001, 0.008, 0.039, 0.041, 0.042, 0.06, 0.074, 0.205, 0.212, 0.216,
0.222, 0.251, 0.269, 0.275, 0.34, 0.341, 0.384, 0.569, 0.594, 0.696, 0.762,0.94,0.942,0.975,0.986]
p = [0.0021, 0.0044, 0.0093, 0.0134, 0.0157, 0.0389, 0.0758]
n = len(p)
alpha = 0.05
# Holm's first
j = 0
while p[j] <= alpha/(n-j):
print('Holm\'s reject null hypothesis {} with adjusted pvalue {}'.format(j+1, p[j]*(n-j)))
j += 1
if j >= n:
break
# Hochberg
j = n
while p[j-1] > alpha/(n-j+1):
print('Hochner\'s accept null hypothesis {} with adjusted pvalue {}'.format(j, p[j-1]*(n-j+1)))
j -= 1
if j == 0:
break
# print all adjusted p-values
padj = [p[j] * (n-j) for j in range(n)]
print('Adjusted p-values: {}'.format(padj))
```

In 1995, work began on using the FDR as a more useful metric for significance, which Sean covered in his lecture Wednesday. This results in fewer false positives, and we are looking at a fundamentally different number. P-value is given that the null hypothesis is true, and FDR is given that the test is significant.

Sean explains how we can estimate the expected value of the FDR in the lecture 5:

```
Suppose you tested a million samples to get your 80,000 positives, at a per-test p-value threshold of <0.05. By the definition of the p-value you expected up to 50,000 false positives: maybe all the samples are in fact from the null hypothesis, and at a significance threshold α=0.05, you expect 5% false positives. So if you trust your numbers, the remaining 30,000 samples are expected to be true positives. You could say that the expected fraction of false positives in your 80,000 positives is 50000/80000 = 62.5%.
```

In this example we get a table like this, with a lot of unknowns

Null hypothesis | Alternative hypothesis | Total | |
---|---|---|---|

Positive test | Expect 50000 | 80000 | |

Negative test | |||

Total | 1000000 |

In the controlling procedure, we make a similar estimate to Holm's and Hochberg's procedure above. Instead of, like Bonferonni, scaling by the total number of tests, we scale by the number of tests less than our $jth$ p-value, $p_j$. This is because if we choose $p_j$ to be our significance cutoff, we'd expect $(m+n)*p_j$ false positives from the total number of tests, and anything less than $p_j$ is just the number of positive tests $r$. Thus $$FDR = x/r = p_j * (m+n) / j$$.

To make the a true step up procedure, we stop and reject when our adjusted FDR values fall below some threshold $\alpha$.

In [13]:

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook
```

In [4]:

```
!C:\Users\Kevin\Documents\Classes\MCB112\hw6\R-3.3.2\bin\Rscript analyze_W.r
```

In [5]:

```
data=pd.read_table('myresult.out',sep=' ')
```

In [6]:

```
data.head(10)
```

Out[6]:

In [7]:

```
pvals = data['PValue'].values
```

In [8]:

```
data.loc['tomato']
```

Out[8]:

In [9]:

```
n = len(pvals)
pr = data.loc['tomato'].PValue
r = np.where(pvals < pr)
r = len(r[0])
```

In [10]:

```
FDR = n*data.loc['tomato'].PValue / r
FDR
```

Out[10]:

In [11]:

```
# Let's make a plot of P_r vs. r
pvals = data['PValue'].values
rvals = []
for pr in pvals:
r = np.where(pvals < pr)
rvals.append(len(r[0]))
```

In [34]:

```
%matplotlib notebook
plt.scatter(rvals,pvals,label='pvalues')
plt.ylabel('p-value')
plt.xlabel('r')
# plot fit line for our FDR cutoff
alpha = 0.05
n = len(pvals)
x2 = n
y2 = alpha
plt.plot([0, x2], [0, y2], color='k', linestyle='-', linewidth=2,label='BH significance cutoff')
plt.legend()
plt.show()
```

The above method makes the assumption that everything comes from the null hypothesis (thus, we say $x$, false positives is the *total* tests times p-value). However, we might be pretty sure our alternative hypothesis is true in some cases. Thus, to find the $FDR = x/r$, we need to find $x = p * n$ (instead of $m+n$), where x is number of misdiscoveries, p is p-value, and n is number of tests from the null distribution.

How do we estimate $n$? One cool trick is to use the distribution of all our p-values. Under the null hypothesis we'd expect all of our p-values to be uniformly distributed. Under the alternative hypothesis, we should have a skewed distribution. Together we get a mixture that might look something like the graph below (taken from Storey and Tibshirani 2003).

Here, the dashed line is the density histogram of the uniform distribution if all the genes were from the null distribution. The dotted line is the estimated height of uniform distribution of the null p-values. The blue line ($\lambda=0.5$) is an example cutoff of where to begin estimating the truly null distribution.

For small p-values, we can assume that the mixture comes more strongly from the alternative hypothesis, but for larger p-values we can assume it comes only from the null. To estimate $n$, the number of tests from the null distribution, we simply count tests in the null distribution, high p-value regime and normalize.

$$ \hat{n} = \frac{\sum_j I(p_j > \lambda)}{(1-\lambda)} $$

Where $p_j$ is the $jth$ p-value, $I()$ is the indicator function (1 if True), and $\lambda$ is a threshold we set to assume p-values are coming from only the null hypothesis. This is often represented as the proportion of null p-values, $\hat{\pi_0} = n / (n+m) = \sum_j I(p_j > \lambda) / ((n+m)(1-\lambda))$.

Then, for some threshold $\alpha$,

$$ \hat{FDR_i}(\alpha) = \frac{\hat{\pi_0} (m+n) \alpha}{\sum_j I(p_j < \alpha)} = \frac{\hat{n} \alpha}{\sum_j I(p_j < \alpha)} $$

If we pick $\alpha = p_i$ for each of our $i$ p-values, and ordered $p_1 \leq p_2 \leq ... \leq p_{m+n}$, this is the same as above, just replacing our estimate $\hat{n}$ for $m+n$.

Storey defines the estimated q-value now as

$$ \hat{q(p_j)} = min_{\alpha \geq p_j} \hat{FDR(\alpha)} $$

Where we tune our $\alpha$ threshold to set the best FDR for each point. Calculating the q-value in a step-up procedure is easy. For sorted p-values $p_1 ... p_m$, $m$ total tests, $n$ tests that are from the null distribution, and threshold $\alpha$:

$$ \hat{q}(p_m) = min_{\alpha \geq p_m} \frac{\hat{n} \alpha}{\sum_j I(p_j < \alpha)} = \frac{\hat{n} p_m}{m} $$

And for stepping up from $i=(m-1):1$,

$$ \hat{q}(p_i) = min_{\alpha \geq p_i} \frac{\hat{n} \alpha}{\sum_j I(p_j < \alpha)} = min\big( \frac{\hat{n} p_i}{i}, \hat{q} (p_{i+1})\big) $$