from: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data
edgeR假设RNA-seq的reads计数对于每个基因来说是符合负二项式分布(Negative Binomial distribution)的。其公式可以表示为, 在这个公式中,p和r都是与平均值及方差相关的参数:
所以可以简写为
edgeR再假设
其中,a是由可以由实验中相关reads数来估计,于是
在计算中,就只有一个参数需要确定了。
我们来看看在edgeR中是如何一步一步来完成关键数据的。
> library(edgeR) > x <- dat ## > head(x) rep1.wt rep2.wt rep3.wt rep1.mu rep2.mu rep3.mu 128up 140 63 76 64 96 102 14-3-3epsilon 2558 1468 2351 643 3366 3423 14-3-3zeta 7109 4275 7319 2047 9293 10198 140up 101 52 49 37 102 103 18w 202 128 216 38 365 428 26-29-p 589 322 352 198 868 748 > dim(x) [1] 15340 6 > group <- factor(c(1,1,1,2,2,2)) > y <- DGEList(counts=x, group=group) > y An object of class "DGEList" $counts rep1.wt rep2.wt rep3.wt rep1.mu rep2.mu rep3.mu 128up 140 63 76 64 96 102 14-3-3epsilon 2558 1468 2351 643 3366 3423 14-3-3zeta 7109 4275 7319 2047 9293 10198 140up 101 52 49 37 102 103 18w 202 128 216 38 365 428 15335 more rows ... $samples group lib.size norm.factors rep1.wt 1 4392624 1 rep2.wt 1 2551036 1 rep3.wt 1 3769017 1 rep1.mu 2 1313816 1 rep2.mu 2 4734075 1 rep3.mu 2 5265483 1 > y <- calcNormFactors(y) > y An object of class "DGEList" $counts rep1.wt rep2.wt rep3.wt rep1.mu rep2.mu rep3.mu 128up 140 63 76 64 96 102 14-3-3epsilon 2558 1468 2351 643 3366 3423 14-3-3zeta 7109 4275 7319 2047 9293 10198 140up 101 52 49 37 102 103 18w 202 128 216 38 365 428 15335 more rows ... $samples group lib.size norm.factors rep1.wt 1 4392624 1.0033780 rep2.wt 1 2551036 0.9949814 rep3.wt 1 3769017 1.0104988 rep1.mu 2 1313816 0.9490503 rep2.mu 2 4734075 1.0372123 rep3.mu 2 5265483 1.0069962 > y <- estimateCommonDisp(y) > y An object of class "DGEList" $counts rep1.wt rep2.wt rep3.wt rep1.mu rep2.mu rep3.mu 128up 140 63 76 64 96 102 14-3-3epsilon 2558 1468 2351 643 3366 3423 14-3-3zeta 7109 4275 7319 2047 9293 10198 140up 101 52 49 37 102 103 18w 202 128 216 38 365 428 15335 more rows ... $samples group lib.size norm.factors rep1.wt 1 4392624 1.0033780 rep2.wt 1 2551036 0.9949814 rep3.wt 1 3769017 1.0104988 rep1.mu 2 1313816 0.9490503 rep2.mu 2 4734075 1.0372123 rep3.mu 2 5265483 1.0069962 $common.dispersion [1] 0.1061981 $pseudo.counts rep1.wt rep2.wt rep3.wt rep1.mu rep2.mu rep3.mu 128up 106.29550 82.91595 66.47710 165.48084 64.77507 63.63193 14-3-3epsilon 1937.23047 1930.58220 2060.49602 1722.83854 2288.34657 2154.92784 14-3-3zeta 5383.81850 5621.98965 6414.62415 5480.88398 6317.35617 6419.98148 140up 76.72546 68.28217 42.77138 97.10852 69.17872 64.55668 18w 152.84329 168.37148 189.36722 105.41837 248.39971 269.95767 15335 more rows ... $pseudo.lib.size [1] 3337947 $AveLogCPM [1] 4.805579 9.239636 10.797894 4.407722 5.849622 15335 more elements ... > y <- estimateTagwiseDisp(y) > y An object of class "DGEList" $counts rep1.wt rep2.wt rep3.wt rep1.mu rep2.mu rep3.mu 128up 140 63 76 64 96 102 14-3-3epsilon 2558 1468 2351 643 3366 3423 14-3-3zeta 7109 4275 7319 2047 9293 10198 140up 101 52 49 37 102 103 18w 202 128 216 38 365 428 15335 more rows ... $samples group lib.size norm.factors rep1.wt 1 4392624 1.0033780 rep2.wt 1 2551036 0.9949814 rep3.wt 1 3769017 1.0104988 rep1.mu 2 1313816 0.9490503 rep2.mu 2 4734075 1.0372123 rep3.mu 2 5265483 1.0069962 $common.dispersion [1] 0.1061981 $pseudo.counts rep1.wt rep2.wt rep3.wt rep1.mu rep2.mu rep3.mu 128up 106.29550 82.91595 66.47710 165.48084 64.77507 63.63193 14-3-3epsilon 1937.23047 1930.58220 2060.49602 1722.83854 2288.34657 2154.92784 14-3-3zeta 5383.81850 5621.98965 6414.62415 5480.88398 6317.35617 6419.98148 140up 76.72546 68.28217 42.77138 97.10852 69.17872 64.55668 18w 152.84329 168.37148 189.36722 105.41837 248.39971 269.95767 15335 more rows ... $pseudo.lib.size [1] 3337947 $AveLogCPM [1] 4.805579 9.239636 10.797894 4.407722 5.849622 15335 more elements ... $prior.n [1] 2.5 $tagwise.dispersion [1] 0.11460553 0.07781749 0.07824818 0.08253584 0.09628774 15335 more elements ... > et <- exactTest(y) > topTags(et) Comparison of groups: 2-1 logFC logCPM PValue FDR mthl8 -8.593569 5.367184 2.534465e-44 3.887870e-40 TrpA1 4.953899 5.430934 1.097770e-31 8.419893e-28 Lasp -6.033451 4.257209 1.327186e-27 6.786344e-24 Cyp6a2 -2.694373 7.226899 3.933563e-14 1.508521e-10 Map205 -3.603284 5.280693 1.648458e-12 5.057468e-09 CR44030 4.029722 3.974972 3.762876e-11 9.620420e-08 PGRP-SD -2.561549 5.640441 5.369299e-11 1.176643e-07 Tep4 -2.558393 7.375954 3.643185e-10 6.985807e-07 CR43767 -2.197766 4.986564 8.653845e-10 1.475000e-06 Ady43A 1.965976 6.086523 1.394387e-09 2.138990e-06 |
通过比较每一步之后y的元素,我们发现,calcNormFactors为y的每个样品增加了normalization factors;estimateCommonDisp为y增加了common.dispersion,并且由这个common.dispersion重新计算了每个基因的计数;estimateTagwiseDisp为y的每个基因计算出了tagwise.dispersion,这个值就是所谓的alpha值了。