理解edgeR

from: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data

edgeR假设RNA-seq的reads计数对于每个基因来说是符合负二项式分布(Negative Binomial distribution)的。其公式可以表示为Pr(K=k) = {k+r-1 \choose k}\cdot (1-p)^r p^k, 在这个公式中,p和r都是与平均值及方差相关的参数:p=\frac{\mu}{\sigma ^2} \text{ and } r=\frac{\mu ^2}{\sigma ^2 - \mu}
所以可以简写为Pr(K=k) = NB(\mu , \sigma^2)
edgeR再假设\sigma^2 = \mu + \alpha \mu^2
其中,a是由可以由实验中相关reads数来估计,于是Pr(K=k) = f(\mu)
在计算中,就只有一个参数需要确定了。

我们来看看在edgeR中是如何一步一步来完成关键数据的。

> library(edgeR)
> x <- dat ##
> head(x)
              rep1.wt rep2.wt rep3.wt rep1.mu rep2.mu rep3.mu
128up             140      63      76      64      96     102
14-3-3epsilon    2558    1468    2351     643    3366    3423
14-3-3zeta       7109    4275    7319    2047    9293   10198
140up             101      52      49      37     102     103
18w               202     128     216      38     365     428
26-29-p           589     322     352     198     868     748
> dim(x)
[1] 15340     6
> group <- factor(c(1,1,1,2,2,2))
> y <- DGEList(counts=x, group=group)
> y
An object of class "DGEList"
$counts
              rep1.wt rep2.wt rep3.wt rep1.mu rep2.mu rep3.mu
128up             140      63      76      64      96     102
14-3-3epsilon    2558    1468    2351     643    3366    3423
14-3-3zeta       7109    4275    7319    2047    9293   10198
140up             101      52      49      37     102     103
18w               202     128     216      38     365     428
15335 more rows ...
 
$samples
        group lib.size norm.factors
rep1.wt     1  4392624            1
rep2.wt     1  2551036            1
rep3.wt     1  3769017            1
rep1.mu     2  1313816            1
rep2.mu     2  4734075            1
rep3.mu     2  5265483            1
 
> y <- calcNormFactors(y)
> y
An object of class "DGEList"
$counts
              rep1.wt rep2.wt rep3.wt rep1.mu rep2.mu rep3.mu
128up             140      63      76      64      96     102
14-3-3epsilon    2558    1468    2351     643    3366    3423
14-3-3zeta       7109    4275    7319    2047    9293   10198
140up             101      52      49      37     102     103
18w               202     128     216      38     365     428
15335 more rows ...
 
$samples
        group lib.size norm.factors
rep1.wt     1  4392624    1.0033780
rep2.wt     1  2551036    0.9949814
rep3.wt     1  3769017    1.0104988
rep1.mu     2  1313816    0.9490503
rep2.mu     2  4734075    1.0372123
rep3.mu     2  5265483    1.0069962
 
> y <- estimateCommonDisp(y)
> y
An object of class "DGEList"
$counts
              rep1.wt rep2.wt rep3.wt rep1.mu rep2.mu rep3.mu
128up             140      63      76      64      96     102
14-3-3epsilon    2558    1468    2351     643    3366    3423
14-3-3zeta       7109    4275    7319    2047    9293   10198
140up             101      52      49      37     102     103
18w               202     128     216      38     365     428
15335 more rows ...
 
$samples
        group lib.size norm.factors
rep1.wt     1  4392624    1.0033780
rep2.wt     1  2551036    0.9949814
rep3.wt     1  3769017    1.0104988
rep1.mu     2  1313816    0.9490503
rep2.mu     2  4734075    1.0372123
rep3.mu     2  5265483    1.0069962
 
$common.dispersion
[1] 0.1061981
 
$pseudo.counts
                 rep1.wt    rep2.wt    rep3.wt    rep1.mu    rep2.mu    rep3.mu
128up          106.29550   82.91595   66.47710  165.48084   64.77507   63.63193
14-3-3epsilon 1937.23047 1930.58220 2060.49602 1722.83854 2288.34657 2154.92784
14-3-3zeta    5383.81850 5621.98965 6414.62415 5480.88398 6317.35617 6419.98148
140up           76.72546   68.28217   42.77138   97.10852   69.17872   64.55668
18w            152.84329  168.37148  189.36722  105.41837  248.39971  269.95767
15335 more rows ...
 
$pseudo.lib.size
[1] 3337947
 
$AveLogCPM
[1]  4.805579  9.239636 10.797894  4.407722  5.849622
15335 more elements ...
 
> y <- estimateTagwiseDisp(y)
> y
An object of class "DGEList"
$counts
              rep1.wt rep2.wt rep3.wt rep1.mu rep2.mu rep3.mu
128up             140      63      76      64      96     102
14-3-3epsilon    2558    1468    2351     643    3366    3423
14-3-3zeta       7109    4275    7319    2047    9293   10198
140up             101      52      49      37     102     103
18w               202     128     216      38     365     428
15335 more rows ...
 
$samples
        group lib.size norm.factors
rep1.wt     1  4392624    1.0033780
rep2.wt     1  2551036    0.9949814
rep3.wt     1  3769017    1.0104988
rep1.mu     2  1313816    0.9490503
rep2.mu     2  4734075    1.0372123
rep3.mu     2  5265483    1.0069962
 
$common.dispersion
[1] 0.1061981
 
$pseudo.counts
                 rep1.wt    rep2.wt    rep3.wt    rep1.mu    rep2.mu    rep3.mu
128up          106.29550   82.91595   66.47710  165.48084   64.77507   63.63193
14-3-3epsilon 1937.23047 1930.58220 2060.49602 1722.83854 2288.34657 2154.92784
14-3-3zeta    5383.81850 5621.98965 6414.62415 5480.88398 6317.35617 6419.98148
140up           76.72546   68.28217   42.77138   97.10852   69.17872   64.55668
18w            152.84329  168.37148  189.36722  105.41837  248.39971  269.95767
15335 more rows ...
 
$pseudo.lib.size
[1] 3337947
 
$AveLogCPM
[1]  4.805579  9.239636 10.797894  4.407722  5.849622
15335 more elements ...
 
$prior.n
[1] 2.5
 
$tagwise.dispersion
[1] 0.11460553 0.07781749 0.07824818 0.08253584 0.09628774
15335 more elements ...
 
> et <- exactTest(y)
> topTags(et)
Comparison of groups:  2-1 
            logFC   logCPM       PValue          FDR
mthl8   -8.593569 5.367184 2.534465e-44 3.887870e-40
TrpA1    4.953899 5.430934 1.097770e-31 8.419893e-28
Lasp    -6.033451 4.257209 1.327186e-27 6.786344e-24
Cyp6a2  -2.694373 7.226899 3.933563e-14 1.508521e-10
Map205  -3.603284 5.280693 1.648458e-12 5.057468e-09
CR44030  4.029722 3.974972 3.762876e-11 9.620420e-08
PGRP-SD -2.561549 5.640441 5.369299e-11 1.176643e-07
Tep4    -2.558393 7.375954 3.643185e-10 6.985807e-07
CR43767 -2.197766 4.986564 8.653845e-10 1.475000e-06
Ady43A   1.965976 6.086523 1.394387e-09 2.138990e-06

通过比较每一步之后y的元素,我们发现,calcNormFactors为y的每个样品增加了normalization factors;estimateCommonDisp为y增加了common.dispersion,并且由这个common.dispersion重新计算了每个基因的计数;estimateTagwiseDisp为y的每个基因计算出了tagwise.dispersion,这个值就是所谓的alpha值了。

发表评论

电子邮件地址不会被公开。 必填项已用*标注