在2013年,MeSH ORA framework在Bioconductor发布了。它很恐怖的一次性发布了123多个包,其中有120个是数据包。MeSH可以做些什么呢?它可以把一个entrezID下所有的发布的文章列出来。你可以直接拿到摘要,甚至可以直接下载到文章。这就厉害了,你想想如果我拿出一堆差异表达的基因,以前只能做做什么GO富集分析啊,或者说是pathway富集分析啊。现在我们做个MeSH关键字富集分析如何?而MeSH ORA fromework就可以做到这一点。我们还可以想象,如果再复杂一点,我们通过做homologous,把不同种的同源蛋白可以非常简单地全部拿出来做关键词分析,也许可以扩展思路呢。以后他们会做些数据的图形化显示,比如tag cloud啊之类的。也许我们也可以做一些事情,比如meta-analyze一下文章的摘要。如果感兴趣,在回复中把自己的想法实现后的package链接给出来吧。
好了,本文就只是粗略地介绍一下如果做一堆基因的关键词富集分析。
# 安装包
library(BiocInstaller)
biocLite("meshr")
library(meshr)
## Loading required package: fdrtool
## Loading required package: Category
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, append,
## as.data.frame, cbind, colnames, do.call, duplicated, eval,
## evalq, get, grep, grepl, intersect, is.unsorted, lapply,
## lengths, mapply, match, mget, order, paste, pmax, pmax.int,
## pmin, pmin.int, rank, rbind, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit
## Loading required package: AnnotationDbi
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## colMeans, colSums, expand.grid, rowMeans, rowSums
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:S4Vectors':
##
## expand
## Loading required package: cummeRbund
## Loading required package: RSQLite
## Loading required package: DBI
## Loading required package: ggplot2
## Loading required package: reshape2
## Loading required package: fastcluster
##
## Attaching package: 'fastcluster'
## The following object is masked from 'package:stats':
##
## hclust
## Loading required package: rtracklayer
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: Gviz
## Loading required package: grid
##
## Attaching package: 'cummeRbund'
## The following object is masked from 'package:GenomicRanges':
##
## promoters
## The following object is masked from 'package:IRanges':
##
## promoters
## The following object is masked from 'package:Biobase':
##
## samples
## The following object is masked from 'package:BiocGenerics':
##
## conditions
## Loading required package: org.Hs.eg.db
##
## Loading required package: MeSH.db
## Loading required package: MeSHDbi
##
## Attaching package: 'MeSHDbi'
## The following object is masked from 'package:utils':
##
## packageName
## Loading required package: MeSH.AOR.db
## Loading required package: MeSH.PCR.db
## Loading required package: MeSH.Hsa.eg.db
## Loading required package: MeSH.Aca.eg.db
## Loading required package: MeSH.Bsu.168.eg.db
## Loading required package: MeSH.Syn.eg.db
data(geneid.cummeRbund) # 所有的基因,
data(sig.geneid.cummeRbund) # 差异表达的基因
# 我们看一下都有多少条
dim(geneid.cummeRbund)
## [1] 303 2
dim(sig.geneid.cummeRbund)
## [1] 104 2
head(geneid.cummeRbund)
## SYMBOL ENTREZID
## 1 OR4F5 79501
## 2 OR4F16 81399
## 3 SAMD11 148398
## 4 KLHL17 339451
## 5 PLEKHN1 84069
## 6 ISG15 9636
head(sig.geneid.cummeRbund)
## SYMBOL ENTREZID
## 1 KLHL17 339451
## 2 PLEKHN1 84069
## 3 AGRN 375790
## 4 TTLL10 254173
## 5 SCNN1D 6339
## 6 VWA1 64856
# 需要载入一些包来进行计算。
library("fdrtool")
library("MeSH.Hsa.eg.db") # 这个Hsa就是Hsapiens的前三个字母。
# 和topGO类似的,它也需要先构建一个MeSHHyperGParams object来指定富集时的参数。
meshParams <- new("MeSHHyperGParams", #
geneIds = sig.geneid.cummeRbund[, 2], #
universeGeneIds = geneid.cummeRbund[, 2], #
annotation = "MeSH.Hsa.eg.db", #
category = "C", # 类似于GO的BP, MF, CC之类的分类
database = "gendoo", #可以是gendoo, gene2pubmed或者RBBH
pvalueCutoff = 0.05, #
pAdjust = "none") # pAdjust设置成none只是因为这个例子很小,大多数的时候,我们可以把它设置成BH,QV或者FDR
meshR <- meshHyperGTest(meshParams) # 超几何分布算一下pvale, 然后通过p value来过滤结果
meshR # 看一下它是个什么对象,是不是和topGO很类似?
## MeSH enrichment analysis for category Diseases
## Annotation package used: MeSH.Hsa.eg.db
## The correspondance is retrived from: gendoo
## Number of MeSH terms identified: 9
head(summary(meshR))
## MESHID Pvalue OddsRatio ExpCount Count Size
## 15774 D009361 0.0002619206 3.942771 11.32673 21 33
## 15775 D009361 0.0002619206 3.942771 11.32673 21 33
## 15776 D009361 0.0002619206 3.942771 11.32673 21 33
## 15777 D009361 0.0002619206 3.942771 11.32673 21 33
## 15778 D009361 0.0002619206 3.942771 11.32673 21 33
## 15779 D009361 0.0002619206 3.942771 11.32673 21 33
## MESHTERM GENEID SOURCEID
## 15774 Neoplasm Invasiveness 2069 <NA>
## 15775 Neoplasm Invasiveness 3106 <NA>
## 15776 Neoplasm Invasiveness 145957 <NA>
## 15777 Neoplasm Invasiveness 5579 <NA>
## 15778 Neoplasm Invasiveness 843 <NA>
## 15779 Neoplasm Invasiveness 3294 <NA>
我们来把类换成G (Phenomena and Processes).这些类别有:
Abbreviations | Categories | No. of terms | Examples |
---|---|---|---|
A |
Anatomy |
|
Muscles, Skeleton |
B |
Organisms |
|
Gram-negative Bacteria |
C |
Diseases |
|
Leukemia, Burns |
D |
Chemicals and Drugs |
|
Fatty Acids, Ligases |
E |
Analytical Diagnostic and Therapeutic |
|
Dental Care |
Techniques |
|||
F |
Psychiatry and Psychology |
|
Behavior, Motivation |
G |
Phenomena and Processes |
|
Antibody Formation |
H |
Disciplines and Occupations |
|
Biology, Clinical Medicine |
I |
Anthropology, Education, Sociology |
|
Economics, Culture |
Social |
|||
J |
Technology and Food and Beverages |
|
Bread, Coffee, Tea |
K |
Humanities |
|
Music, Religion |
L |
Information Science |
|
Communications Media |
M |
Persons |
|
Adult Children, Drug Users |
N |
Health Care |
|
Oral Health, Women’s Health |
V |
Publication Type |
|
Book Illustrations, Letter |
Z |
Geographical Locations |
|
Japan, China, Taiwan |
category(meshParams) <- "G"
database(meshParams) <- "gene2pubmed"
meshR <- meshHyperGTest(meshParams)
meshR
## MeSH enrichment analysis for category Phenomena and Processes
## Annotation package used: MeSH.Hsa.eg.db
## The correspondance is retrived from: gene2pubmed
## Number of MeSH terms identified: 9
head(summary(meshR))
## MESHID Pvalue OddsRatio ExpCount Count Size
## 1533148 D060066 4.585389e-06 3.256735 60.75248 79 177
## 1533149 D060066 4.585389e-06 3.256735 60.75248 79 177
## 1533150 D060066 4.585389e-06 3.256735 60.75248 79 177
## 1533151 D060066 4.585389e-06 3.256735 60.75248 79 177
## 1533152 D060066 4.585389e-06 3.256735 60.75248 79 177
## 1533153 D060066 4.585389e-06 3.256735 60.75248 79 177
## MESHTERM GENEID SOURCEID
## 1533148 Protein Interaction Maps 5901 22939629
## 1533149 Protein Interaction Maps 8061 21988832
## 1533150 Protein Interaction Maps 6122 23443559
## 1533151 Protein Interaction Maps 9730 23443559
## 1533152 Protein Interaction Maps 389114 26186194
## 1533153 Protein Interaction Maps 6146 21853274
我们再来看一组例子
library(mesh)
library(fdrtool)
library(MeSH.Hsa.eg.db)
allege <- select(MeSH.Hsa.eg.db, columns="GENEID", keytype="SOURCED", keys="gene2pubmed")
siggene <- sig.geneid.cummeRbund[, 2]
meshParams <- new("MeSHHyperGParams",
geneIds = siggene,
universeGeneIds = allgene,
annotation = "MeSH.Hsa.eg.db",
category = "G",
database = "gene2pubmed",
pvalueCutoff = 0.05,
pAdjust = "QV")
output <- meshHyperGTest(meshParams)
save.pdf(output) # 输出所有的文献,慎用。