bioconductor系列教程之二简单的Affymetrix 外显子(exon)分析流程(上)

从映射表获取注释(mapping to annotation)

首先加载xmapcore库。

> library(xmapcore)
Loading required package: IRanges

Attaching package: 'IRanges'

The following object(s) are masked from 'package:Biobase':

    updateObject

The following object(s) are masked from 'package:base':

    cbind, eval, Map, mapply, order, paste, pmax, pmax.int, pmin,
    pmin.int, rbind, rep.int, table


Attaching package: 'xmapcore'

The following object(s) are masked from 'package:exonmap':

    exon.details, exon.to.gene, exon.to.probeset, exon.to.transcript,
    exonic, exons.in.range, gene.details, gene.to.exon,
    gene.to.probeset, gene.to.transcript, genes.in.range, intergenic,
    intronic, is.exonic, is.intergenic, is.intronic, probes.in.range,
    probeset.to.exon, probeset.to.gene, probeset.to.probe,
    probeset.to.transcript, probesets.in.range, symbol.to.gene,
    transcript.details, transcript.to.exon, transcript.to.gene,
    transcript.to.probeset, transcripts.in.range

然后连接数据库。数据库的名称就是之前我们自己在databases.txt中定义的name

>xmap.connect("human")
Using /Users/jianhongou/.xmapcore as our configuration directory
Connected to xmapcore_homo_sapiens_60 (localhost)
Selected array 'HuEx-1_0' as a default.

我们可以先把探针当中属于exon的筛出来。

> ab.sigs.exon <- exonic(ab.sigs)
> cd.sigs.exon <- exonic(cd.sigs)
> length(ab.sigs.exon)
[1] 10161
> length(cd.sigs.exon)
[1] 4855
> length(ab.sigs)
[1] 13262
> length(cd.sigs)
[1] 6501

xmapcore发展出来很多可以用于映射的函数,下面这个列表就是其说明文件当中的;

xmapcore可以用于映射的函数
xmapcore可以用于映射的函数

我们实际来示范其中一例,因为是数据库查寻,所以会耗费很长时间,所以示例当中只取前几个来示范一下:

> ab.sigs.exon.transcript<-probeset.to.transcript(ab.sigs.exon[1:5])
> ab.sigs.exon.transcript
[1] "ENST00000377541" "ENST00000377532" "ENST00000361923" "ENST00000473653"
[5] "ENST00000463106"
> ab.sigs.exon.transcript<-probeset.to.transcript(ab.sigs.exon[1:5],as.vector=F) 
> ab.sigs.exon.transcript

我们可以得到以下的结果:

我们还可以使用split来轻松提取上图中的一些信息,比如我们想提取probeset的id号相对应的transcript的Ensembl的id号,以便我们和探针组的表达量捆绑成一个表。
我们先来看一下split如何使用:

> split(ab.sigs.exon.transcript[["stable_id"]],ab.sigs.exon.transcript[["IN1"]])
$`2318663`
[1] "ENST00000377541" "ENST00000377532" "ENST00000361923" "ENST00000473653"

$`2318673`
[1] "ENST00000377541" "ENST00000377532" "ENST00000361923"

$`2318675`
[1] "ENST00000377541" "ENST00000377532" "ENST00000361923" "ENST00000463106"

$`2318693`
[1] "ENST00000377532" "ENST00000361923"

$`2318696`
[1] "ENST00000377532" "ENST00000361923"

> split(ab.sigs.exon.transcript[["IN1"]],ab.sigs.exon.transcript[["stable_id"]])
$ENST00000361923
[1] "2318663" "2318673" "2318675" "2318693" "2318696"

$ENST00000377532
[1] "2318663" "2318673" "2318675" "2318693" "2318696"

$ENST00000377541
[1] "2318663" "2318673" "2318675"

$ENST00000463106
[1] "2318675"

$ENST00000473653
[1] "2318663"

实际操作一下:

> limma.sigs<-merge(as.matrix(limma.ab.sigs),as.matrix(limma.cd.sigs),by.x="V1",by.y="V1")
> dim(limma.sigs)
[1] 31687     1
> ab.limma.result<-topTable(fit2,coef=1,n=nrow(fit2))
> cd.limma.result<-topTable(fit2,coef=2,n=nrow(fit2))
> ab.limma.result[1:5,]
             ID     logFC  AveExpr         t      P.Value    adj.P.Val        B
921120  3455224 -6.196771 9.881612 -55.66352 3.135890e-18 4.200518e-12 28.52506
921302  3455431 -6.118445 9.687390 -51.94952 8.442633e-18 4.200518e-12 28.00089
1167748 3757166 -6.421536 8.650709 -51.74667 8.929743e-18 4.200518e-12 27.97012
825072  3337188 -6.387290 9.063700 -50.57334 1.240597e-17 4.376793e-12 27.78739
29034   2350522  6.027372 8.193881  49.06193 1.916356e-17 4.462122e-12 27.53959
> x.pli.eset<-exprs(x.pli)
> x.pli.eset[1:5,]
        0306_YH148_EX_exMCF7_r1.CEL 0406_YH156_EX_exMCF7_r2.CEL
2315101                    6.734836                    6.210889
2315102                    6.906976                    6.468061
2315103                    7.044645                    7.306846
2315104                    6.854274                    6.458891
2315105                    6.122687                    6.248681
        0406_YH157_EX_exMCF7_r3.CEL 0306_YH149_EX_exMCF10A_r1.CEL
2315101                    6.596643                      6.812924
2315102                    7.078068                      6.882564
2315103                    7.139287                      7.019080
2315104                    6.905860                      7.296486
2315105                    5.909402                      5.993334
        0406_YH158_EX_exMCF10A_r2.CEL 0406_YH159_EX_exMCF10A_r3.CEL
2315101                      6.309658                      6.217530
2315102                      6.710062                      6.671573
2315103                      7.313474                      7.352729
2315104                      6.884769                      6.835728
2315105                      6.105691                      6.321194
        0806_YH178_EX_100MCF7_r1.CEL 0806_YH179_EX_100MCF7_r2.CEL
2315101                     6.599960                     6.011707
2315102                     6.856438                     6.883725
2315103                     6.904066                     7.127221
2315104                     6.682055                     6.352926
2315105                     6.505144                     6.228677
        0806_YH180_EX_100MCF7_r3.CEL 0806_YH181_EX_100MCF10A_r1.CEL
2315101                     6.468076                       6.436096
2315102                     6.484725                       6.235714
2315103                     7.418980                       7.433090
2315104                     6.605767                       6.831317
2315105                     6.552830                       6.066675
        0806_YH182_EX_100MCF10A_r2.CEL 0806_YH183_EX_100MCF10A_r3.CEL
2315101                       6.116550                       6.205399
2315102                       6.785608                       6.868030
2315103                       7.248153                       7.581015
2315104                       6.362065                       6.551861
2315105                       6.484628                       6.269624
> ab.limma.result$AveExpr<-NULL
> cd.limma.result$AveExpr<-NULL
> x.pli.eset<-cbind(rownames(x.pli.eset),x.pli.eset)
> colnames(ab.limma.result)
[1] "ID"        "logFC"     "t"         "P.Value"   "adj.P.Val" "B"        
> colnames(ab.limma.result)[2:6]<-paste("ab",colnames(ab.limma.result)[2:6],sep=".")
> colnames(ab.limma.result)
[1] "ID"           "ab.logFC"     "ab.t"         "ab.P.Value"   "ab.adj.P.Val"
[6] "ab.B"        
> colnames(cd.limma.result)[2:6]<-paste("cd",colnames(cd.limma.result)[2:6],sep=".")
> colnames(cd.limma.result)
[1] "ID"           "cd.logFC"     "cd.t"         "cd.P.Value"   "cd.adj.P.Val"
[6] "cd.B"        
> limma.result<-merge(ab.limma.result,cd.limma.result,by.x="ID",by.y="ID")
> dim(limma.result)
[1] 1411189      11
> dim(ab.limma.result)
[1] 1411189       6
> dim(cd.limma.result)
[1] 1411189       6
> colnames(x.pli.eset)
 [1] ""                               "0306_YH148_EX_exMCF7_r1.CEL"   
 [3] "0406_YH156_EX_exMCF7_r2.CEL"    "0406_YH157_EX_exMCF7_r3.CEL"   
 [5] "0306_YH149_EX_exMCF10A_r1.CEL"  "0406_YH158_EX_exMCF10A_r2.CEL" 
 [7] "0406_YH159_EX_exMCF10A_r3.CEL"  "0806_YH178_EX_100MCF7_r1.CEL"  
 [9] "0806_YH179_EX_100MCF7_r2.CEL"   "0806_YH180_EX_100MCF7_r3.CEL"  
[11] "0806_YH181_EX_100MCF10A_r1.CEL" "0806_YH182_EX_100MCF10A_r2.CEL"
[13] "0806_YH183_EX_100MCF10A_r3.CEL"
> colnames(x.pli.eset)[1]<-"ID"
> colnames(x.pli.eset)
 [1] "ID"                             "0306_YH148_EX_exMCF7_r1.CEL"   
 [3] "0406_YH156_EX_exMCF7_r2.CEL"    "0406_YH157_EX_exMCF7_r3.CEL"   
 [5] "0306_YH149_EX_exMCF10A_r1.CEL"  "0406_YH158_EX_exMCF10A_r2.CEL" 
 [7] "0406_YH159_EX_exMCF10A_r3.CEL"  "0806_YH178_EX_100MCF7_r1.CEL"  
 [9] "0806_YH179_EX_100MCF7_r2.CEL"   "0806_YH180_EX_100MCF7_r3.CEL"  
[11] "0806_YH181_EX_100MCF10A_r1.CEL" "0806_YH182_EX_100MCF10A_r2.CEL"
[13] "0806_YH183_EX_100MCF10A_r3.CEL"
> limma.gep<-merge(limma.result,x.pli.eset,by.x="ID",by.y="ID")
> limma.sigs<-limma.sigs$V1
> limma.sigs.filtered<-exonic(limma.sigs)
> length(limma.sigs.filtered)
[1] 16569
> limma.sigs.gene<-probeset.to.gene(as.character(limma.sigs.filtered),as.vector=F)
> colnames(limma.sigs.gene)
[1] "IN1"                "stable_id"          "strand"            
[4] "biotype"            "status"             "description"       
[7] "db_display_name"    "symbol"             "symbol_description"
> limma.sigs.exon<-probeset.to.exon(as.character(limma.sigs.filtered),as.vector=F)
> colnames(limma.sigs.exon)
[1] "IN1"       "stable_id" "strand"    "phase"     "end_phase" "sequence" 
> limma.sigs.gene.s<-cbind(as.character(limma.sigs.gene$IN1),as.character(limma.sigs.gene$stable_id))
> summary(limma.sigs.gene.s)
       V1                      V2       
 2832769:   17   ENSG00000168702:   73  
 2832605:   10   ENSG00000137501:   57  
 2832617:   10   ENSG00000163629:   57  
 2832618:   10   ENSG00000138829:   56  
 2832619:   10   ENSG00000119285:   52  
 2832560:    4   ENSG00000146648:   52  
 (Other):14614   (Other)        :14328  
> limma.sigs.exon.s<-cbind(as.character(limma.sigs.exon$IN1),as.character(limma.sigs.exon$stable_id))
> colnames(limma.sigs.gene.s)=c("IN1","gene_id")
> colnames(limma.sigs.exon.s)=c("IN1","exon_id")
> limma.sigs.gene.exon<-merge(limma.sigs.gene.s,limma.sigs.exon.s,by.x="IN1",by.y="IN1")
> dim(limma.sigs.gene.exon)
[1] 28515     3
> dim(limma.sigs.gene.s)
[1] 14675     2
> dim(limma.sigs.exon.s)
[1] 27310     2
> limma.gep.map<-merge(limma.gep,limma.sigs.gene.exon,by.x="ID",by.y="IN1")
> colnames(limma.gep.map)
 [1] "ID"                             "ab.logFC"                      
 [3] "ab.t"                           "ab.P.Value"                    
 [5] "ab.adj.P.Val"                   "ab.B"                          
 [7] "cd.logFC"                       "cd.t"                          
 [9] "cd.P.Value"                     "cd.adj.P.Val"                  
[11] "cd.B"                           "0306_YH148_EX_exMCF7_r1.CEL"   
[13] "0406_YH156_EX_exMCF7_r2.CEL"    "0406_YH157_EX_exMCF7_r3.CEL"   
[15] "0306_YH149_EX_exMCF10A_r1.CEL"  "0406_YH158_EX_exMCF10A_r2.CEL" 
[17] "0406_YH159_EX_exMCF10A_r3.CEL"  "0806_YH178_EX_100MCF7_r1.CEL"  
[19] "0806_YH179_EX_100MCF7_r2.CEL"   "0806_YH180_EX_100MCF7_r3.CEL"  
[21] "0806_YH181_EX_100MCF10A_r1.CEL" "0806_YH182_EX_100MCF10A_r2.CEL"
[23] "0806_YH183_EX_100MCF10A_r3.CEL" "gene_id"                       
[25] "exon_id"  
> head(limma.gep.map)
       ID  ab.logFC       ab.t   ab.P.Value ab.adj.P.Val      ab.B  cd.logFC
1 2316685 -1.271696  -7.250734 3.571424e-06 1.688156e-04  4.457462 -2.328130
2 2316685 -1.271696  -7.250734 3.571424e-06 1.688156e-04  4.457462 -2.328130
3 2316686 -1.723314  -6.837043 6.959413e-06 2.887439e-04  3.771528 -1.732243
4 2316686 -1.723314  -6.837043 6.959413e-06 2.887439e-04  3.771528 -1.732243
5 2316728 -2.806349 -12.607441 3.534239e-09 6.581525e-07 11.508047 -2.877095
6 2316728 -2.806349 -12.607441 3.534239e-09 6.581525e-07 11.508047 -2.877095
        cd.t   cd.P.Value cd.adj.P.Val      cd.B 0306_YH148_EX_exMCF7_r1.CEL
1 -13.274125 1.777678e-09 4.459011e-07 12.198354            6.66720095808582
2 -13.274125 1.777678e-09 4.459011e-07 12.198354            6.66720095808582
3  -6.872466 6.567110e-06 3.117951e-04  3.865359            5.71487085699491
4  -6.872466 6.567110e-06 3.117951e-04  3.865359            5.71487085699491
5 -12.925266 2.537385e-09 5.883551e-07 11.845069             5.8335709236835
6 -12.925266 2.537385e-09 5.883551e-07 11.845069             5.8335709236835
  0406_YH156_EX_exMCF7_r2.CEL 0406_YH157_EX_exMCF7_r3.CEL
1            7.12286977374843            7.02936981804394
2            7.12286977374843            7.02936981804394
3            6.46763461194765            6.06050808336302
4            6.46763461194765            6.06050808336302
5            5.68249703349793            5.50837248196063
6            5.68249703349793            5.50837248196063
  0306_YH149_EX_exMCF10A_r1.CEL 0406_YH158_EX_exMCF10A_r2.CEL
1              8.23540227329305              8.17812317570771
2              8.23540227329305              8.17812317570771
3              7.83730980050119              7.77799079405764
4              7.83730980050119              7.77799079405764
5              9.01732493288772              8.13193299035441
6              9.01732493288772              8.13193299035441
  0406_YH159_EX_exMCF10A_r3.CEL 0806_YH178_EX_100MCF7_r1.CEL
1              8.22100347851285             5.70723526495675
2              8.22100347851285             5.70723526495675
3              7.79765502853488             5.66402291191698
4              7.79765502853488             5.66402291191698
5              8.29422973153101              5.1335892026865
6              8.29422973153101              5.1335892026865
  0806_YH179_EX_100MCF7_r2.CEL 0806_YH180_EX_100MCF7_r3.CEL
1             6.13085929461148             6.13202227159858
2             6.13085929461148             6.13202227159858
3             6.03051433988139             6.22950136530396
4             6.03051433988139             6.22950136530396
5             5.23439051131718             5.73423240849994
6             5.23439051131718             5.73423240849994
  0806_YH181_EX_100MCF10A_r1.CEL 0806_YH182_EX_100MCF10A_r2.CEL
1               8.58751002497943                8.3969944212826
2               8.58751002497943                8.3969944212826
3                8.0848947778275               7.02740001708546
4                8.0848947778275               7.02740001708546
5                8.4367690768553                8.3532107437173
6                8.4367690768553                8.3532107437173
  0806_YH183_EX_100MCF10A_r3.CEL         gene_id         exon_id
1               7.97000354395198 ENSG00000149527 ENSE00001338342
2               7.97000354395198 ENSG00000149527 ENSE00001937861
3               8.00847152535703 ENSG00000149527 ENSE00001906260
4               8.00847152535703 ENSG00000149527 ENSE00001338337
5               7.94351784045955 ENSG00000149527 ENSE00001826643
6               7.94351784045955 ENSG00000149527 ENSE00001432737

9 thoughts on “bioconductor系列教程之二简单的Affymetrix 外显子(exon)分析流程(上)

  1. xmapcore_homo_sapiens_60已下载在本机如何将其添加入mysql服务器。我的系统是win7,mysql也安装在本机上。谢谢。

  2. 您好,当安装”xmapcore”包的时候,提示说不适用于当前版本的R,不知道有没有解决办法?

  3. 欧老师,又来麻烦您了。这篇博客中的数据下载地址不能打开了,因为我的电脑内存不够,我在按照bioconductor系列教程之二简单的Affymetrix 外显子(exon)分析流程中(质量控制)步骤操作的时候,
    +- rawData/
    | +- /
    | | +- / <– must match exactly a chip type folder under annotationData/
    | | +- CEL files
    中的cel files可以用什么文件代替呢?

发表评论

电子邮件地址不会被公开。 必填项已用*标注