### section week 8: RNA-Seq modeling/edgeR by James Xue¶

Available as a Jupyter notebook page or in html.

In this section we will attempt to take a closer look at the Poisson/negative binomial model and see how they can be used to model RNA-seq data.

Specifically, we can think of the poisson distribution as capturing "technical" noise and the negative binomial distribution as capturing "biological" and "technical" noise together.

After learning the negative binomial distribution, we will simulate our own RNA-seq dataset and run edgeR to test the program.

First, we will go back to the binomial distribution, recall that the binomial distribution is as follows:

$$P(x \mid p, N) = { N \choose x } p^x (1-p)^{N-x}$$

$N$ is the number of trials - $p$ can be interepreted as the probability of success.

Recall that the poisson distribution is as follows:

$P(x|\mu)=e^{\lambda}\frac{\lambda^x}{x!}$

Let $N\to \infty$ and $p \to 0$, and $Np=\lambda$, then it can be shown that:

${ N \choose x } p^b (1-p)^{N-x}\approx e^{\lambda}\frac{\lambda^x}{x!}$

For a proof of this (x is equivalent to k below),

The proof is pulled from here: https://en.wikipedia.org/wiki/Poisson_limit_theorem

Thus, the Poisson distribution is a very good approximation to the binomial distribution for large $N$ and small $p$.

Lets see this in simulations!

In [22]:
#Lets use the specific alpha, beta vlaues shown in the graph above AND use a range of sample sizes from 10 to 10000
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook

nValues=[  50, 100]
probValues=[ 0.05, 0.2, 0.8]

fig, ax=plt.subplots(len(nValues), len(probValues),sharex=True, sharey=True)

colorMap={0:'r', 1:'b', 2:'g', 3:'m', 4:'orange'}

for ind in range(len(nValues)):

#given sample size, extract the samples from the superset
nValue=nValues[ind]

for ind2 in range(len(probValues)):
probValue=probValues[ind2]

lambdaVal=nValue*probValue

#plot the subfigure
color=colorMap[ind2]
xAxisVals = np.arange(stats.poisson.ppf(0.01, lambdaVal), stats.poisson.ppf(0.99, lambdaVal) )

ax[ind, ind2].plot(xAxisVals, stats.poisson.pmf(xAxisVals ,lambdaVal) ,color=str(color), linestyle='-', alpha=0.35 , label="Poisson")

ax[ind, ind2].plot(xAxisVals, stats.binom.pmf(xAxisVals ,nValue, probValue) ,color=str(color), linestyle=':',linewidth=1, label="Binomial" )

ax[ind, ind2].legend()
#ax[ind, ind2].xaxis.set_visible(False)
#ax[ind, ind2].yaxis.set_visible(False)
#for setting the labels
if ind2 == 0:
ax[ind, ind2].set_ylabel("n={0}".format(nValue))
if ind == len(ax)-1:
ax[ind, ind2].set_xlabel("p={0}".format(probValue))

plt.show()

Now that we can think of the poisson distribution as being related to the binomial, lets try to relate the binomial distribution with RNA-Seq data.

$N$ in our case would be the total number of RNA-seq reads. $p$ in our case would be the probability of sampling a particular gene.

In the RNA-Seq scenario, we see that we have a very large $N$ and a very small $p$, thus it may be safe to use a poisson distribution to model the sampling variability of a particular gene.

In a typlical RNA-Seq experiment, we may have 30 million reads, and have $p=1/20000$, where 20000 is the number of genes, so lets set $N=30*10^6$ and $p=1/25000$ (thus $\lambda=N*p=1500$ and see how well the poisson approximates the binomial.

In [77]:
n=30*(10**6)

p=1/20000

mu=n*p

xAxisVals = np.arange(stats.poisson.ppf(0.01, mu), stats.poisson.ppf(0.99, mu) )

plt.plot(xAxisVals, stats.poisson.pmf(xAxisVals ,mu) , linestyle='-', alpha=0.35, label="Poisson")

plt.plot(xAxisVals, stats.binom.pmf(xAxisVals ,n, p) , linestyle=':' , linewidth=3, label="Binomial")

plt.legend()

plt.show()

Wow! looks pretty good huh? so it may be safe for us to use the poisson distribution to model RNA-Seq count data.

Sean mentioned how a poisson distribution is good to model techinical noise. That is imagine that we have one biologial sample, and we repeatedly do RNA-Seq for the same sample, then the counts for a gene that we acquire for ALL RNA-Seq experiments on the single sample will be poisson distributed.

This line of thinking is analagous to the simulation that we just did, in which the true proportion of the gene in the RNA-Seq sample stays the same (at 1/20000), but if we repeatedly get counts of that gene from each RNA-Seq experiment (each one 30 million reads), then our distribution will be approximately poisson.

Now, we just described how to model technical variability, but what about biological variability?

Well in class, aside from the poisson distribution, RNA-Seq has also been extensively modeled using the negative binomial distribution. Recall that the negative binomial distribution is as follows:

$$P(x \mid \mu, \phi) = \frac{\Gamma \left( x + \phi^{-1} \right)} { \Gamma (\phi^{-1}) \; \Gamma (x+1)} \left( \frac{1}{1 + \mu \phi} \right)^{\phi^{-1}} \left( \frac{\mu }{\phi^{-1} + \mu} \right)^x$$

The mean of the NB is $\mu$. Its variance is $\mu + \phi \mu^2$. If $\phi\to0$, then the negative binomial distribution collapses to a poisson distribution. The negative binomial distribution can then be visualized as a more "flexible" distribution with more "allowance" for different variances. This is one way to understand it.. but is there another way?

Also, remember that if we reparametrize our earlier negative binomial distrubtion by $p = \frac{1}{1 + \mu \phi}$ and $n=\frac{1}{\phi}$, recall that the negative binomial distribution can also be thought of as:

$$P(x \mid n,p) = { x+n-1 \choose n-1 } p^n (1-p)^x$$

Sean mentioned way back briefly that the negative binomial can be modeled as a mixture of poisson distributions, with the parameter $\lambda$, being distributed as a Gamma distribution. Lets try to see it out:

Specifically, the negative binomial distribution is derived from the following:

$$P(X=x)=\int_{0}^{\infty} P(X=x|\lambda)p(\lambda)d\lambda$$

, where $P(X=x|\lambda)$ is the poisson distribution with parameter $\lambda$ and $p(\lambda)$ is the gamma distribution with parameters $\alpha,\beta$.

The gamma distribution is as follows:

$$P(\lambda)=\frac{\beta^{\alpha} }{\Gamma(\alpha) }\lambda^{\alpha-1} e^{-\beta\lambda}$$

To see a proof of this, check out this: https://probabilityandstats.wordpress.com/tag/poisson-gamma-mixture/ (note that in the proof, $\alpha$ and $\beta$ are interchanged compared to what I defined above).

Knowing the above, the following can be used to generate a negative binomial distriubution.

First, draw $\lambda$ from $Gamma(\alpha,\beta)$. Then, draw a count from $Pois(\lambda)$ If we repeat this process a lot and plot the distribution of counts, then our counts will look to be negative binomially distributed!

To relate back to our RNA-Seq example, $\lambda$ represents the biological replicate that we sample from a specific condition (i.e. wildtype), then after we sample our biological replicate, we perform RNA-Seq on the replicate, and our counts will have poisson variability relating to our earlier discussion.

Again, lets simulate to see it happen.

Imagine a scenario where the true $\lambda$ has a $Gamma(1.5, 0.001)$. Note also that the expected value of a Gamma distribution is $\frac{\alpha}{\beta}$, thus the expected value of the distribution is 1500.

That is, our $\lambda$ has the following distribution:

In [82]:
alpha=1.5
beta=0.001
xAxisVals = np.linspace(stats.gamma.ppf(0.01, alpha, scale=1/beta), stats.gamma.ppf(0.99, alpha, scale=1/beta), 100)
plt.plot(xAxisVals, stats.gamma.pdf(xAxisVals ,alpha, scale=1/beta), '-' )
plt.show()

Now, we will draw lambda values from the gamma distribution, and for each lambda value drawn, we will draw a count from a poisson distribution with that parameter.

We will then compare this distribution with the true negative binomial disribution.

In [25]:
#number of lambda values to draw from the Gamma distribution shown - "i.e. number of biological replicates"

#for each lambda value drawn, we will draw counts from a poisson disribution with parameter set to the lambda value drawn
#"i.e. number of techinical replicates from the biological replicate"
numberCountDraws=100

allCounts=[]
#draw a lambda value

for countInd in range(numberCountDraws):
allCounts.append(count)

#now we will plot the distribution of counts from this simulation, and plot the theoretical negative binomial curve

#plot true negative binomial distribution
plt.hist(allCounts, alpha=0.5, bins=50, normed=True, label="Poisson-Gamma Simulation")

#this is to get it in the right parameteriztion for the NB distribution
#for reasoning see equation 18 of this:  https://probabilityandstats.wordpress.com/tag/poisson-gamma-mixture/
#remember that in above proof, I interchanged alpha and beta
#and it with the scipy parameterization:  https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.stats.nbinom.html

p=beta/(beta+1)
n=alpha

xAxisVals = np.arange(stats.nbinom.ppf(0.01, n, p), stats.nbinom.ppf(0.99, n, p) )

plt.plot(xAxisVals, stats.nbinom.pmf(xAxisVals ,n, p) , linestyle='-', alpha=0.35, label="Negative Binomial")

plt.legend()

plt.show()

### Simulating RNA-Seq data using the negative binomial (NB) model and testing edgeR¶

Now, in the spirit of the class, lets try running edgeR on simulated data and see how good it is.

In [70]:
#pctDiffExp is the percent of genes that will be differentially expressed
pctDiffExp=0.01

#diffExp are the different magnitudes that can be given to a gene that is differentially expressed
diffExp=[1,1.5,2]

numGenes=20000

phi=0.1

#first simulate which genes are going to be differentially expressed
diffExpInd=np.random.choice(range(0, 10000), int(pctDiffExp*numGenes) )

#the default log fold change is 0

#diffExpGeneMagnitude is a list holding the magnitudes of the differentially expressed genes
diffExpGeneMagnitude=[0]*numGenes
for geneInd in range(numGenes):
if geneInd in diffExpInd:

#randomly pick magnitude and direction
diffExpIter=np.random.choice(diffExp,1)[0]
diffExpGeneMagnitude[geneInd]=diffExpIter*float(np.random.choice([1,-1], 1)[0])

#now, knowing which genes are differentially expressed, we simulate each sample's counts using the negative binomial distribution
#note that these genes will all have different mu's, but the same phi - which is the same assumption in edgeR

initialMus=stats.gamma.rvs( alpha, scale=1/beta, size=numGenes)
numSamples=6

#0 will represent wild type
#1 will represent treatment
sampleConditionID=[0,0,0,1,1,1]

allSampleCounts=[]
for sampleInd in range(numSamples):

#generate the counts for each sample
sampleIterGeneCounts=[0]*numGenes

sampleCond=sampleConditionID[sampleInd]

#wildtype
if sampleCond==0:
for geneInd in range(numGenes):

muVal=initialMus[geneInd]

p=1/(1+muVal*phi)
r=1/phi

simCount=stats.nbinom.rvs(r,p, size=1)[0]
sampleIterGeneCounts[geneInd]=simCount

#treatment
else:
for geneInd in range(numGenes):

#take the gene magnitude difference into account
muVal=(2**(diffExpGeneMagnitude[geneInd]))*initialMus[geneInd]

p=1/(1+muVal*phi)
r=1/phi

simCount=stats.nbinom.rvs(r,p, size=1)[0]
sampleIterGeneCounts[geneInd]=simCount

allSampleCounts.append(sampleIterGeneCounts)

#now we will write out the simulated count file
outFileName="mydata.tbl"
with open(outFileName, "w") as outFile:

for geneInd in range(numGenes):
outFile.write("Sim_Gene_"+str(geneInd)+"\t")
for sampleInd in range(numSamples-1):
outFile.write(str(allSampleCounts[sampleInd][geneInd])+"\t")
outFile.write(str(allSampleCounts[sampleInd+1][geneInd])+"\n")
outFile.close()

#write out the true gene magnitude
outFileName="true_gene_magnitudes.txt"
with open(outFileName, "w") as outFile:

for geneInd in range(len(diffExpGeneMagnitude)):
outFile.write("Sim_Gene_"+str(geneInd)+"\t"+str(diffExpGeneMagnitude[geneInd])+"\n")
outFile.close()

Now that we generated our simulated data, lets take a look at what it looks like

In [71]:
Sim_Gene_0	4019	2672	1982	1494	3720	4895
Sim_Gene_1	413	558	231	173	483	220
Sim_Gene_2	558	363	277	263	522	537
Sim_Gene_3	1420	1850	762	2336	1956	1513
Sim_Gene_4	1274	1370	609	1777	765	783
Sim_Gene_5	1557	1703	1765	1167	1511	1153
Sim_Gene_6	2638	1817	2349	1204	1583	2227
Sim_Gene_7	606	688	925	640	1181	530
Sim_Gene_8	366	847	579	387	627	464
Sim_Gene_9	2412	3909	2077	5787	3188	3858
In [72]:
Sim_Gene_0	0
Sim_Gene_1	0
Sim_Gene_2	0
Sim_Gene_3	0
Sim_Gene_4	0
Sim_Gene_5	0
Sim_Gene_6	0
Sim_Gene_7	0
Sim_Gene_8	0
Sim_Gene_9	0

Lets only take a look from our simulations at the genes that are differentially expressed.

The last column of the file is the log fold change

In [73]:
!paste mydata.tbl true_gene_magnitudes.txt | awk '$NF!=0{print}' | cut -f1-7,9 Sim_Gene_12 1189 707 916 2822 1818 2171 2.0 Sim_Gene_161 308 215 470 1776 2545 1476 2.0 Sim_Gene_166 802 1354 1578 6835 5277 6978 2.0 Sim_Gene_225 1022 940 595 325 261 454 -1.0 Sim_Gene_273 6158 5045 3251 5788 6266 4344 1.0 Sim_Gene_317 933 1948 1019 4783 3746 3739 2.0 Sim_Gene_342 251 490 133 132 70 49 -1.5 Sim_Gene_469 2280 2682 2009 4984 5218 4423 2.0 Sim_Gene_499 1962 2179 2082 1996 678 726 -1.0 Sim_Gene_541 979 713 814 2533 3133 3024 1.5 Sim_Gene_553 824 496 754 168 332 290 -1.5 Sim_Gene_600 3596 3983 1563 1917 1269 1699 -1.0 Sim_Gene_628 647 1106 977 430 336 237 -1.5 Sim_Gene_660 2308 1440 816 2250 4929 3209 1.0 Sim_Gene_794 1589 1975 1436 379 997 755 -1.0 Sim_Gene_803 400 469 388 3643 1934 3252 2.0 Sim_Gene_872 473 377 322 1001 1782 3200 2.0 Sim_Gene_901 475 438 414 797 295 634 1.0 Sim_Gene_958 208 383 196 91 206 160 -1.0 Sim_Gene_959 1362 718 831 2686 2670 3742 1.5 Sim_Gene_973 244 427 329 103 72 124 -1.5 Sim_Gene_994 737 545 744 2210 2647 2483 2.0 Sim_Gene_1128 1035 981 1164 579 1017 700 -1.0 Sim_Gene_1143 1050 1219 1533 1158 787 450 -1.0 Sim_Gene_1155 1179 694 1093 3522 3428 1813 2.0 Sim_Gene_1200 975 1029 607 268 342 397 -1.0 Sim_Gene_1225 3584 2354 3559 8995 8011 5284 1.0 Sim_Gene_1226 2303 1857 2763 583 1083 491 -1.0 Sim_Gene_1265 1995 1676 818 578 468 454 -1.0 Sim_Gene_1304 365 400 457 1203 2055 1052 2.0 Sim_Gene_1367 121 118 145 34 55 46 -1.5 Sim_Gene_1384 2081 1471 1582 295 545 738 -1.5 Sim_Gene_1399 1012 351 724 468 262 291 -1.5 Sim_Gene_1654 2046 3033 7271 8601 6916 6645 1.0 Sim_Gene_1684 593 842 703 379 243 247 -1.0 Sim_Gene_1693 464 760 868 321 145 312 -2.0 Sim_Gene_1791 2651 4441 2203 511 1867 590 -1.5 Sim_Gene_1893 557 1049 1151 284 501 355 -1.5 Sim_Gene_1958 797 729 884 4690 3017 3989 2.0 Sim_Gene_2010 1940 2586 1521 534 545 614 -1.5 Sim_Gene_2055 1400 969 734 424 414 333 -1.5 Sim_Gene_2195 1145 1317 2135 491 643 432 -2.0 Sim_Gene_2275 4404 4955 2952 7963 11568 17938 1.5 Sim_Gene_2364 1077 1296 984 2562 3846 1724 2.0 Sim_Gene_2452 671 1409 1558 5046 2840 3006 2.0 Sim_Gene_2480 81 339 100 888 703 793 2.0 Sim_Gene_2488 111 293 78 375 355 292 1.0 Sim_Gene_2525 2172 2281 2190 7350 8118 8532 1.5 Sim_Gene_2526 819 873 690 1182 2334 2398 1.5 Sim_Gene_2713 1450 1279 1851 1952 3472 4005 1.0 Sim_Gene_2859 508 908 878 3494 2320 3143 1.5 Sim_Gene_2894 528 814 570 1930 1769 2083 1.5 Sim_Gene_3069 1581 980 2332 283 417 505 -2.0 Sim_Gene_3126 841 499 501 262 75 127 -2.0 Sim_Gene_3167 186 112 103 54 42 43 -2.0 Sim_Gene_3242 588 528 681 1694 1637 1186 1.0 Sim_Gene_3244 1061 716 901 86 228 277 -2.0 Sim_Gene_3254 851 458 428 185 282 116 -1.0 Sim_Gene_3275 143 122 117 365 283 206 1.0 Sim_Gene_3371 4726 4291 3040 1487 1381 1231 -1.5 Sim_Gene_3471 1728 2638 1318 796 1524 1101 -1.0 Sim_Gene_3491 1492 1346 1777 940 539 510 -1.5 Sim_Gene_3587 365 451 738 1295 1709 1882 1.5 Sim_Gene_3666 4011 2491 3903 7085 6445 4668 1.0 Sim_Gene_3835 199 369 362 239 161 268 -1.0 Sim_Gene_3949 2875 1959 3308 9289 19692 9128 2.0 Sim_Gene_4029 2224 2331 1834 7166 9350 3842 1.5 Sim_Gene_4081 2951 5278 2226 866 222 881 -2.0 Sim_Gene_4109 174 86 295 59 76 97 -1.0 Sim_Gene_4115 53 46 46 285 78 224 1.5 Sim_Gene_4132 740 982 809 967 1700 1833 1.0 Sim_Gene_4134 2782 2731 3050 10611 15884 8984 2.0 Sim_Gene_4159 1347 577 891 1685 1518 1035 1.0 Sim_Gene_4174 933 704 529 2015 1868 4934 2.0 Sim_Gene_4248 203 369 114 746 800 462 2.0 Sim_Gene_4332 541 675 575 271 147 227 -1.5 Sim_Gene_4335 1975 2445 4626 825 1668 1415 -1.0 Sim_Gene_4347 210 237 451 98 55 102 -2.0 Sim_Gene_4474 464 438 410 135 200 252 -1.0 Sim_Gene_4563 133 229 134 433 733 669 1.5 Sim_Gene_4577 1216 1910 2034 6968 10474 5131 2.0 Sim_Gene_4721 1230 1182 933 292 133 132 -2.0 Sim_Gene_4753 132 272 134 389 321 324 1.0 Sim_Gene_4773 1095 1308 1742 2784 2371 3295 1.5 Sim_Gene_4799 2616 2739 3008 1520 2066 858 -1.0 Sim_Gene_4931 4796 3277 3280 386 985 908 -2.0 Sim_Gene_4976 171 275 199 1095 1281 1392 2.0 Sim_Gene_5040 571 842 990 756 994 2097 1.5 Sim_Gene_5041 365 469 792 2116 1173 974 1.0 Sim_Gene_5072 1622 1059 2251 4139 4235 4867 1.5 Sim_Gene_5127 4 12 11 3 1 5 -1.0 Sim_Gene_5212 1137 881 930 359 365 894 -1.0 Sim_Gene_5244 1011 1055 1482 2059 2493 2412 1.5 Sim_Gene_5281 787 1357 1770 1797 1472 1321 1.0 Sim_Gene_5285 2330 1468 4016 4033 5857 2563 1.0 Sim_Gene_5386 534 295 284 104 83 47 -2.0 Sim_Gene_5437 1002 734 1165 171 221 310 -2.0 Sim_Gene_5451 239 262 294 541 510 534 1.5 Sim_Gene_5489 1982 1663 1803 464 229 427 -2.0 Sim_Gene_5496 2498 1312 2702 782 748 1116 -1.5 Sim_Gene_5501 709 1049 1708 441 401 902 -1.0 Sim_Gene_5533 1527 1455 1268 691 548 317 -1.5 Sim_Gene_5567 1932 2925 2605 606 783 337 -2.0 Sim_Gene_5572 538 346 597 175 204 215 -2.0 Sim_Gene_5606 1672 924 2467 497 707 409 -1.5 Sim_Gene_5707 3095 3668 2433 4057 8542 15531 2.0 Sim_Gene_5718 1065 972 956 6536 3354 7819 2.0 Sim_Gene_5742 2462 2564 3497 4225 9169 7249 1.5 Sim_Gene_5743 200 182 297 84 144 149 -1.0 Sim_Gene_5810 978 1034 942 304 360 258 -2.0 Sim_Gene_5830 3876 2910 2822 5858 7824 6287 1.5 Sim_Gene_5834 617 585 391 1614 1467 1225 1.0 Sim_Gene_5854 1617 873 1124 1466 962 1126 -1.0 Sim_Gene_5921 496 695 937 250 176 366 -2.0 Sim_Gene_6058 801 924 613 197 265 141 -2.0 Sim_Gene_6113 2210 2150 1752 443 566 502 -2.0 Sim_Gene_6114 521 363 755 122 111 181 -1.5 Sim_Gene_6173 1562 2182 1718 2898 2414 5309 1.5 Sim_Gene_6178 615 443 675 138 183 249 -2.0 Sim_Gene_6192 4829 3843 3708 7626 9337 13665 1.0 Sim_Gene_6225 1449 1661 1125 6568 6399 7312 2.0 Sim_Gene_6253 1992 1486 1496 4839 2759 3853 1.0 Sim_Gene_6307 515 1470 1366 296 230 351 -2.0 Sim_Gene_6365 790 1440 1961 7417 6032 3366 2.0 Sim_Gene_6576 3273 4304 3986 5373 8950 4945 1.0 Sim_Gene_6583 710 435 677 1447 2248 1502 1.5 Sim_Gene_6597 4242 2878 5029 14253 16941 16275 2.0 Sim_Gene_6607 401 295 234 702 654 820 1.0 Sim_Gene_6631 1060 1232 1650 1463 1817 2991 1.5 Sim_Gene_6654 2155 1194 933 7056 4857 5770 2.0 Sim_Gene_6664 570 420 476 116 157 172 -1.5 Sim_Gene_6668 4181 7890 6340 1104 1287 779 -2.0 Sim_Gene_6673 977 1169 1084 6710 5338 3150 1.5 Sim_Gene_6687 280 273 480 257 288 104 -1.5 Sim_Gene_6693 3536 3857 5154 2700 1692 1961 -1.0 Sim_Gene_6799 404 560 361 3080 2508 2490 2.0 Sim_Gene_6860 3249 3097 4336 1363 1954 1607 -1.0 Sim_Gene_6880 2301 1708 3691 4085 5176 1749 1.0 Sim_Gene_6895 2081 2027 3268 5865 9109 9923 1.5 Sim_Gene_6896 2954 2098 1653 731 1001 868 -1.5 Sim_Gene_6921 2550 2887 3201 3424 4809 4300 1.0 Sim_Gene_6929 3299 2968 1529 20076 9189 18282 1.5 Sim_Gene_6947 4868 3067 3772 8460 3814 5225 1.0 Sim_Gene_7020 797 601 783 2266 2697 2576 2.0 Sim_Gene_7082 3911 3479 5209 11430 7710 7247 1.0 Sim_Gene_7144 952 633 569 1805 1342 1285 1.0 Sim_Gene_7163 2346 2294 3905 11152 11562 9905 1.5 Sim_Gene_7262 1461 3938 4844 12986 10416 10731 1.5 Sim_Gene_7333 2084 2626 4054 1314 1666 1659 -1.5 Sim_Gene_7355 176 105 163 27 10 27 -2.0 Sim_Gene_7500 2764 2965 3768 21665 14391 12453 2.0 Sim_Gene_7516 2204 1310 1397 5725 2586 2706 1.5 Sim_Gene_7602 1158 1243 1525 12672 11370 8631 2.0 Sim_Gene_7658 656 1154 1026 1592 2556 2773 1.5 Sim_Gene_7689 764 409 766 161 120 203 -2.0 Sim_Gene_7710 1019 903 698 233 193 221 -2.0 Sim_Gene_7790 4827 2470 3602 3813 5296 3698 1.0 Sim_Gene_7826 1678 4135 3738 8405 6933 8887 1.0 Sim_Gene_7845 1556 2538 2081 5719 5711 7840 2.0 Sim_Gene_7849 1173 634 887 1503 1190 590 1.0 Sim_Gene_7921 2284 4169 3132 8647 14505 21930 2.0 Sim_Gene_7970 781 627 1148 985 855 733 -1.0 Sim_Gene_7984 22 26 18 5 7 9 -2.0 Sim_Gene_8100 2079 1545 1799 1072 1663 1047 -1.0 Sim_Gene_8216 451 388 468 47 77 115 -2.0 Sim_Gene_8259 1688 976 1341 2230 2426 3367 1.0 Sim_Gene_8274 2090 1913 1744 7523 6684 5394 2.0 Sim_Gene_8314 666 1029 1458 216 190 229 -2.0 Sim_Gene_8319 1127 919 742 5282 4975 1969 2.0 Sim_Gene_8446 321 318 240 24 102 61 -2.0 Sim_Gene_8539 809 736 469 245 286 178 -2.0 Sim_Gene_8759 1619 3585 3230 417 737 296 -2.0 Sim_Gene_8764 2873 2087 2029 7970 13497 20910 2.0 Sim_Gene_8877 575 1244 799 805 385 586 -1.0 Sim_Gene_8933 1288 1471 2148 2548 4176 4450 1.5 Sim_Gene_9030 1509 787 329 2803 3708 5338 2.0 Sim_Gene_9097 86 127 173 37 30 24 -2.0 Sim_Gene_9100 497 534 488 149 217 61 -2.0 Sim_Gene_9116 1966 1460 1266 385 652 498 -2.0 Sim_Gene_9119 1149 568 836 1412 1213 2091 1.0 Sim_Gene_9153 1390 1193 1444 313 285 194 -1.5 Sim_Gene_9233 1057 628 803 402 385 251 -1.5 Sim_Gene_9275 741 807 386 194 340 141 -1.0 Sim_Gene_9278 405 183 352 104 111 99 -1.5 Sim_Gene_9332 2161 2781 2358 15685 10043 14874 2.0 Sim_Gene_9350 388 466 321 121 131 113 -1.5 Sim_Gene_9358 971 883 998 657 944 735 -1.0 Sim_Gene_9455 1652 1680 1625 845 982 938 -1.0 Sim_Gene_9560 451 287 397 85 88 134 -1.5 Sim_Gene_9653 1785 3707 1715 6191 10282 6874 1.5 Sim_Gene_9872 470 582 410 205 161 159 -1.0 Sim_Gene_9897 1219 1314 702 535 434 392 -1.5 Sim_Gene_9912 665 602 572 193 202 131 -1.5 Sim_Gene_9948 1764 905 1682 611 535 500 -1.5 Sim_Gene_9991 560 718 586 2232 2251 1624 2.0 In [74]: #see below for the exact code in analyze_GL_JX_section.r !Rscript analyze_GL_JX_section.r Loading required package: limma Warning message: package ‘limma’ was built under R version 3.3.3 Design matrix not provided. Switch to the classic mode. [1] "The dispersion estimate is:" [1] 0.09968116 We see that the dispersion estimate is pretty close to our simulated data, 0.1, not bad! Now, lets take a look to see how well edgeR does in capturing our true differentially expressed genes!/ the true FDR and compare it to the reported FDR If we set a FDR cutoff of 0.1, then edgeR picks up the following: In [76]: FDRCutoff=0.1 trueGeneMagnitudes=[] trueGeneMagnitudeFileName="true_gene_magnitudes.txt" trueGeneMagnitudesDict=dict() totalNumPositives=0 totalNumPositivesAllArr=[] with open(trueGeneMagnitudeFileName, "r") as trueGeneMagnitudeFile: for line in trueGeneMagnitudeFile: line=line.split() geneId=line[0] trueMagnitude=float(line[1]) trueGeneMagnitudesDict[geneId]=trueMagnitude if trueMagnitude!=0: totalNumPositives+=1 TP=0 FP=0 TPAllArr=[] edgeROutputFileName="myresult.out" with open(edgeROutputFileName, "r") as edgeROutputFile: #skip header line edgeROutputFile.readline() for line in edgeROutputFile: line=line.split() #use [1:] to ignore quotations geneId=line[0][1:len(line[0])-1] #note that edgeR scrambles the gene order that was originally inputted in edgeRFDR=float(line[4]) if(edgeRFDR<FDRCutoff and trueGeneMagnitudesDict[geneId]!=0 ): TP+=1 if(edgeRFDR<FDRCutoff and trueGeneMagnitudesDict[geneId]==0 ): FP+=1 string="edgeR was able to detect {} of all genes with fold changes (sensitivity)".format(TP/float(totalNumPositives)) print(string) print("\n") string="the true FDR is {}".format(FP/float(TP+FP)) print(string) edgeR was able to detect 0.6307692307692307 of all genes with fold changes (sensitivity) the true FDR is 0.15172413793103448 Above, we see that the advertised FDR matches pretty closely with the true FDR, so edgeR also gives a good approximation to the true FDR. The modified Rscript code (from analyze_GL_JX_section.r ) I used is as follows: Remember that the below is R code, so you need to copy it to a file, and run it with Rscript to see it work In [ ]: # G. Lestrade's workmanlike RNA-seq analysis script # # To run: # % Rscript analyze_GL.r # # You need to have your data in a file called "mydata.tbl" # This is a tab-delimited file with a header line, followed # by one line per gene: # <genename> <counts1> <counts2> ... <counts6> # Script assumes that there are six samples, three from one # condition (e.g. wild type), three from another (e.g. mutant). # # The script generates an output file "myresult.out". # library(edgeR) infile <- "mydata.tbl" group <- factor(c(1,1,1,2,2,2)) outfile <- "myresult.out" x <- read.table(infile, sep='\t', row.names=1) y <- DGEList(counts=x,group=group) y <- estimateDisp(y) et <- exactTest(y) tab <- topTags(et, nrow(x)) write.table(tab, file=outfile) ######## look at dispersion estimate ############ print("The dispersion estimate is:") print(y$common.dispersion)
#################################################