生物信息学生R入门教程–R编程

本教程版权为《糗世界》所有,任何组织或个人不得未经书面许可转载。

在阅读本节前请自测,如果全部都知道正确答案,则可跳过本节。

  • 常用函数是否全部掌握(见下表)?
  • 请编写一个函数并达到自己的设计要求。
  • 为什么R应该尽量避免for循环?
  • 什么是引用类(refference class)

[……]

Read more

GO/KEGG富集分析enrichment analysis

本教程使用GOstats来对给定的基因symbols做富集分析。

> library("org.Hs.eg.db")
> library("GSEABase")
> library("GOstats")
> genes <- c("AREG", "FKBP5", "CXCL13", "KLF9", "ZC3H12A", "P4HA1", "TLE1", "CREB3L2", "TXNIP", "PBX1", "GJA1", "ITGB8", "CCL3", "CCND2", "KCNJ15", "CFLAR", "CXCL10", "CYSLTR1", "IGFBP7", "RHOB", "MAP3K5", "CAV2", "CAPN2", "AKAP13", "RND3", "IL6ST", "RGS1", "IRF4", "G3BP1", "SEL1L", "VEGFA", "SMAD1", "CCND1", "CLEC3B", "NEB", "AMD1", "PDCD4", "SCD", "TM2D3", "BACH2", "LDLR", "BMPR1B", "RFXAP", "ASPH", "PTK2B", "SLC1A5", "ENO2", "TRPM8", "SATB1", "MIER1", "SRSF1", "ATF3", "CCL5", "MCM6", "GCH1", "CAV1", "SLC20A1")
> ##GO BP enrichment analysis
> goAnn <- get("org.Hs.egGO")
> universe <- Lkeys(goAnn)
> head(universe)
[1] "1"         "10"        "100"       "1000"      "10000"     "100008586"
> entrezIDs <- mget(genes, org.Hs.egSYMBOL2EG, ifnotfound=NA)
> head(entrezIDs)
$AREG
[1] "374"
 
$FKBP5
[1] "2289"
 
$CXCL13
[1] "10563"
 
$KLF9
[1] "687"
 
$ZC3H12A
[1] "80149"
 
$P4HA1
[1] "5033"
 
> entrezIDs <- as.character(entrezIDs)
> head(entrezIDs)
[1] "374"   "2289"  "10563" "687"   "80149" "5033" 
> params <- new("GOHyperGParams",
+                  geneIds=entrezIDs,
+                  universeGeneIds=universe,
+                  annotation="org.Hs.eg.db",
+                  ontology="BP",
+                  pvalueCutoff=0.01,
+                  conditional=FALSE,
+                  testDirection="over")
> over <- hyperGTest(params)
> library(Category)
> glist <- geneIdsByCategory(over)
> glist <- sapply(glist, function(.ids) {
+ 	.sym <- mget(.ids, envir=org.Hs.egSYMBOL, ifnotfound=NA)
+ 	.sym[is.na(.sym)] <- .ids[is.na(.sym)]
+ 	paste(.sym, collapse=";")
+ 	})
> head(glist)
                                                                                                                                                                                         GO:0002690 
                                                                                                                                                              "PTK2B;CXCL10;CCL3;CCL5;VEGFA;CXCL13" 
                                                                                                                                                                                         GO:0002688 
                                                                                                                                                              "PTK2B;CXCL10;CCL3;CCL5;VEGFA;CXCL13" 
                                                                                                                                                                                         GO:0048518 
"RHOB;ASPH;ATF3;CCND1;BMPR1B;CAV1;CAV2;CCND2;PTK2B;GJA1;IL6ST;CXCL10;IRF4;LDLR;SMAD1;MAP3K5;PBX1;RFXAP;CCL3;CCL5;SLC20A1;TLE1;CLEC3B;VEGFA;CFLAR;CXCL13;TXNIP;CYSLTR1;AKAP13;MIER1;CREB3L2;ZC3H12A" 
                                                                                                                                                                                         GO:1901623 
                                                                                                                                                                           "PTK2B;CCL3;CCL5;CXCL13" 
                                                                                                                                                                                         GO:0002687 
                                                                                                                                                              "PTK2B;CXCL10;CCL3;CCL5;VEGFA;CXCL13" 
                                                                                                                                                                                         GO:2000403 
                                                                                                                                                                           "PTK2B;CCL3;CCL5;CXCL13" 
> bp <- summary(over)
> bp$Symbols <- glist[as.character(bp$GOBPID)]
> head(bp)
      GOBPID       Pvalue OddsRatio    ExpCount Count Size                                        Term
1 GO:0002690 3.490023e-08  39.24490  0.19267044     6   52 positive regulation of leukocyte chemotaxis
2 GO:0002688 9.278865e-08  32.80297  0.22601725     6   61          regulation of leukocyte chemotaxis
3 GO:0048518 1.170957e-07   4.28014 13.56103476    32 3660   positive regulation of biological process
4 GO:1901623 1.176400e-07 128.80174  0.04816761     4   13         regulation of lymphocyte chemotaxis
5 GO:0002687 1.496978e-07  30.05918  0.24454325     6   66  positive regulation of leukocyte migration
6 GO:2000403 2.969857e-07  96.58170  0.05928321     4   16 positive regulation of lymphocyte migration
                                                                                                                                                                                            Symbols
1                                                                                                                                                               PTK2B;CXCL10;CCL3;CCL5;VEGFA;CXCL13
2                                                                                                                                                               PTK2B;CXCL10;CCL3;CCL5;VEGFA;CXCL13
3 RHOB;ASPH;ATF3;CCND1;BMPR1B;CAV1;CAV2;CCND2;PTK2B;GJA1;IL6ST;CXCL10;IRF4;LDLR;SMAD1;MAP3K5;PBX1;RFXAP;CCL3;CCL5;SLC20A1;TLE1;CLEC3B;VEGFA;CFLAR;CXCL13;TXNIP;CYSLTR1;AKAP13;MIER1;CREB3L2;ZC3H12A
4                                                                                                                                                                            PTK2B;CCL3;CCL5;CXCL13
5                                                                                                                                                               PTK2B;CXCL10;CCL3;CCL5;VEGFA;CXCL13
6                                                                                                                                                                            PTK2B;CCL3;CCL5;CXCL13
> dim(bp)
[1] 525   8
> ##KEGG enrichment analysis
> keggAnn <- get("org.Hs.egPATH")
> universe <- Lkeys(keggAnn)
> params <- new("KEGGHyperGParams", 
+                     geneIds=entrezIDs, 
+                     universeGeneIds=universe, 
+                     annotation="org.Hs.eg.db", 
+                     categoryName="KEGG", 
+                     pvalueCutoff=0.01,
+                     testDirection="over")
> over <- hyperGTest(params)
> kegg <- summary(over)
> library(Category)
> glist <- geneIdsByCategory(over)
> glist <- sapply(glist, function(.ids) {
+ 	.sym <- mget(.ids, envir=org.Hs.egSYMBOL, ifnotfound=NA)
+ 	.sym[is.na(.sym)] <- .ids[is.na(.sym)]
+ 	paste(.sym, collapse=";")
+ 	})
> kegg$Symbols <- glist[as.character(kegg$KEGGID)]
> kegg
  KEGGID       Pvalue OddsRatio ExpCount Count Size                                   Term                                    Symbols
1  04510 9.643801e-05  7.873256 1.124361     7  200                         Focal adhesion    CCND1;CAPN2;CAV1;CAV2;CCND2;ITGB8;VEGFA
2  04060 5.487123e-04  5.821855 1.489779     7  265 Cytokine-cytokine receptor interaction BMPR1B;IL6ST;CXCL10;CCL3;CCL5;VEGFA;CXCL13
3  04062 3.739699e-03  5.486219 1.062521     5  189            Chemokine signaling pathway              PTK2B;CXCL10;CCL3;CCL5;CXCL13
> library("pathview")
> gIds <- mget(genes, org.Hs.egSYMBOL2EG, ifnotfound=NA)
> gEns <- unlist(gIds)
> gene.data <- rep(1, length(gEns))
> names(gene.data) <- gEns
> for(i in 1:3){pv.out <- pathview(gene.data, pathway.id=as.character(kegg$KEGGID)[i], species="hsa", out.suffix="pathview", kegg.native=T)}

写成函数

enrich <- function(entrezIDs, orgDbName="org.Hs.eg.db", pvalueCutoff=.01){
	require(orgDbName, character.only=TRUE)
	require("GSEABase")
	require("GOstats")
	require("Category")
	goAnn <- get(gsub(".db", "GO", orgDbName))
	universe <- Lkeys(goAnn)
	onto <- c("BP", "MF", "CC")
	res <- lapply(onto, function(.onto){
		param <- new('GOHyperGParams',
					 geneIds= entrezIDs,
					 universeGeneIds=universe,
					 annotation=orgDbName,
					 ontology=.onto,
					 pvalueCutoff=pvalueCutoff,
					 conditional=FALSE,
					 testDirection="over")
		over <- hyperGTest(param)
		glist <- geneIdsByCategory(over)
		glist <- sapply(glist, function(.ids) {
			.sym <- mget(.ids, envir=get(gsub(".db", "SYMBOL", orgDbName)), ifnotfound=NA)
			.sym[is.na(.sym)] <- .ids[is.na(.sym)]
			paste(.sym, collapse=";")
		})
		summary <- summary(over)
		if(nrow(summary)>1) summary$Symbols <- glist[as.character(summary[, 1])]
		summary
	})
	names(res) <- onto
	keggAnn <- get(gsub(".db", "PATH", orgDbName))
	universe <- Lkeys(keggAnn)
	param <- new("KEGGHyperGParams",
				 geneIds=entrezIDs,
				 universeGeneIds=universe,
				 annotation=orgDbName,
				 categoryName="KEGG",
				 pvalueCutoff=pvalueCutoff,
				 testDirection="over")
	over <- hyperGTest(param)
	kegg <- summary(over)
	glist <- geneIdsByCategory(over)
	glist <- sapply(glist, function(.ids) {
		.sym <- mget(.ids, envir=get(gsub(".db", "SYMBOL", orgDbName)), ifnotfound=NA)
		.sym[is.na(.sym)] <- .ids[is.na(.sym)]
		paste(.sym, collapse=";")
	})
	kegg$Symbols <- glist[as.character(kegg$KEGGID)]
	res[["kegg"]] <- kegg
	res
}
 
go <- enrich(entrezIDs, "org.Hs.eg.db", .05)

如何使用bioconductor做liftover

当我们有不同版本的bed文件,比如说有的bed文件map到hg18, 有的bed文件map到hg19,或者对于小鼠来说有些文件map到mm9,有些文件map到mm10。我们如何实现对比对到不同基因组版本的数据进行比较呢?在比较之前,第一个步骤就是将不同的版本的比对文件统一至一个版本,这就需要使用到UCSC提供的liftOver工具。其可执行工具在:http://hgdownload.cse.ucs[……]

Read more

使用Bioconductor下载GEO(Gene Expression Omnibus)上的数据

本文以GSE46106数据为例,讲述如何从GEO上下载数据。分为两个方面,第一,直接从GEO上下载表达数据,第二,直接从GEO上下载CEL文件。

首先我们来看第一种。

> library(GEOquery)
> gset <- getGEO("GSE46106", GSEMatrix =TRUE)
Found 1 file(s)
GSE46106_series_matrix.txt.gz
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE46nnn/GSE46106/matrix/GSE46106_series_matrix.txt.gz'
ftp data connection made, file length 4110183 bytes
opened URL
==================================================
downloaded 3.9 Mb
 
File stored at: 
/var/folders/n4/11sc2xz13k56hl85z_h2rgq00000gn/T//RtmpBFQdpL/GPL570.soft
> length(gset)
[1] 1
> gset <- gset[[1]]
> head(pData(gset)[,1:5])
> # load NCBI platform annotation
> gpl <- annotation(gset)
> platf <- getGEO(gpl, AnnotGPL=TRUE)
File stored at: 
/var/folders/n4/11sc2xz13k56hl85z_h2rgq00000gn/T//RtmpBFQdpL/GPL570.annot.gz
There were 30 warnings (use warnings() to see them)
> ncbifd <- data.frame(attr(dataTable(platf), "table"))
> eset <- exprs(gset)
> head(eset[,1:5])
          GSM1123782 GSM1123783 GSM1123784 GSM1123785 GSM1123786
1007_s_at    10.1689    10.5247    10.8179    10.3539    10.4964
1053_at       9.6002     7.9436     9.8653     9.9733    10.1960
117_at        5.6808     5.0301     3.7654     2.7751     2.8179
121_at        4.2268     4.6148     4.6147     4.4977     4.6147
1255_g_at     2.1869     2.1869     2.1869     2.1869     2.1869
1294_at       2.1874     2.1874     2.1874     2.1874     2.1874
> head(ncbifd[,1:5])
         ID                                  Gene.title Gene.symbol Gene.ID UniGene.title
1 1007_s_at discoidin domain receptor tyrosine kinase 1        DDR1     780          <NA>
2   1053_at replication factor C (activator 1) 2, 40kDa        RFC2    5982          <NA>
3    117_at        heat shock 70kDa protein 6 (HSP70B')       HSPA6    3310          <NA>
4    121_at                                paired box 8        PAX8    7849          <NA>
5 1255_g_at     guanylate cyclase activator 1A (retina)      GUCA1A    2978          <NA>
6   1294_at ubiquitin-like modifier activating enzyme 7        UBA7    7318          <NA>

接着是第二种。

> library(GEOquery)
> getGEOSuppFiles("GSE46106")
[1] "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE46nnn/GSE46106/suppl/"
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE46nnn/GSE46106/suppl//GSE46106_RAW.tar'
ftp data connection made, file length 399247360 bytes
opened URL
===============================================
downloaded 380.8 Mb
 
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE46nnn/GSE46106/suppl//filelist.txt'
ftp data connection made, file length 3308 bytes
opened URL
==================================================
downloaded 3308 bytes
> setwd("GSE46106/")
> dir()
[1] "filelist.txt"     "GSE46106_RAW.tar"
> untar("GSE46106_RAW.tar")
> files <- dir(pattern="gz$")
> sapply(files, gunzip)
GSM1123782_2147TG11_U133plus2.CEL.gz GSM1123783_2147TG15_U133plus2.CEL.gz  GSM1123784_2147TG1_U133plus2.CEL.gz 
                            33380229                             32529207                             33390958 
 GSM1123785_2147TG6_U133plus2.CEL.gz  GSM1123786_2277TG1_U133plus2.CEL.gz  GSM1123787_2277TG9_U133plus2.CEL.gz 
                            33210730                             33082826                             33531908 
GSM1123788_2665TG15_U133plus2.CEL.gz  GSM1123789_2665TG1_U133plus2.CEL.gz  GSM1123790_2665TG6_U133plus2.CEL.gz 
                            32232774                             33380782                             33432878 
 GSM1123791_3104TG1_U133plus2.CEL.gz  GSM1123792_3104TG3_U133plus2.CEL.gz  GSM1123793_3107TG1_U133plus2.CEL.gz 
                            32850644                             32588606                             32584342 
 GSM1123794_3107TG5_U133plus2.CEL.gz  GSM1123795_3143TG1_U133plus2.CEL.gz  GSM1123796_3143TG5_U133plus2.CEL.gz 
                            33358932                             32369391                             32806108 
 GSM1123797_3204TG1_U133plus2.CEL.gz  GSM1123798_3204TG5_U133plus2.CEL.gz  GSM1123799_3561TG1_U133plus2.CEL.gz 
                            32557782                             33016324                             32883029 
 GSM1123800_3561TG5_U133plus2.CEL.gz  GSM1123801_3611TG1_U133plus2.CEL.gz  GSM1123802_3611TG5_U133plus2.CEL.gz 
                            33454274                             32657648                             33126603 
 GSM1123803_3613TG1_U133plus2.CEL.gz  GSM1123804_3613TG5_U133plus2.CEL.gz  GSM1123805_3807TG1_U133plus2.CEL.gz 
                            32581902                             33193826                             32791917 
 GSM1123806_3807TG5_U133plus2.CEL.gz  GSM1123807_3824TG2_U133plus2.CEL.gz  GSM1123808_3887TG1_U133plus2.CEL.gz 
                            33266865                             33078422                             32479022 
 GSM1123809_3887TG5_U133plus2.CEL.gz  GSM1123810_3904TG1_U133plus2.CEL.gz  GSM1123811_3904TG5_U133plus2.CEL.gz 
                            33291287                             32645537                             33425796 
 GSM1123812_3936TG1_U133plus2.CEL.gz  GSM1123813_3936TG5_U133plus2.CEL.gz  GSM1123814_3963TG1_U133plus2.CEL.gz 
                            32939923                             33317196                             32725410 
 GSM1123815_3963TG5_U133plus2.CEL.gz  GSM1123816_4013TG1_U133plus2.CEL.gz  GSM1123817_4169TG2_U133plus2.CEL.gz 
                            33276082                             33151406                             33167461 
 GSM1123818_4175TG1_U133plus2.CEL.gz  GSM1123819_4272TG1_U133plus2.CEL.gz  GSM1123820_4400TG1_U133plus2.CEL.gz 
                            33139829                             33111635                             33206021 
 GSM1123821_4400TG5_U133plus2.CEL.gz  GSM1123822_4664TG1_U133plus2.CEL.gz  GSM1123823_4849TG1_U133plus2.CEL.gz 
                            33026806                             33364390                             33300210 
 GSM1123824_4888TG1_U133plus2.CEL.gz  GSM1123825_4913TG1_U133plus2.CEL.gz 
                            33299396                             33207528 
> filelist <- dir(pattern="CEL$")
> library(affy)
> library(annotate)
> data <- ReadAffy(filenames=filelist)
> affydb<-annPkgName(data@annotation,type="db")
> require(affydb, character.only=TRUE)
> eset<-rma(data,verbose=FALSE)
> eset.e <- exprs(eset)
> library(annaffy)
> symbols<-as.character(aafSymbol(as.character(rownames(eset)),affydb))
> genes<-as.character(aafUniGene(as.character(rownames(eset)),affydb))

视频链接:
youtube

土豆网[……]

Read more

如何在R中使用超几何分布计算p-value

问题,假设有两组ChIP-seq的数据A和B,我想知道它们之间是否相关。A组一共有50个数据,B组一共有60个数据,它们之间重叠的数据有30个,总计的binding位点如果是200个,那么如果它们相关的p-value如何计算呢?这个问题其实就是转变成抽到重叠数比30或者比30大的可能性是多大。所以我们需要计算抽到重叠数为29的可能性,然后用1减去它就可以了。
公式:p.value = phype[……]

Read more

如何让R自动保存历史记录(auto save R history before crash)

也许第一眼看到这个题目,您可能并不认为这是一个问题,因为大家都知道,R会自动在当前工作目录保存名为.Rhistory或者.Rapp.history的文件。这个文件中就是R自动保存的历史记录。但是这种自动保存的历史记录有两个缺点,第一,它只在正常退出的时候保存;第二,它有行数限制。

我提出的这个问题,其实是因为我经常遇到R崩溃的情况,或者R无限耗用内存的情况。在这些情况下,R无法正常退出,也就[……]

Read more

如何获取R中代码的运行效率

对于如何测试R代码的运行效率,首先可能想到的就是proc.time()以及system.time()了。

我们来看代码:

> ptm <- proc.time()
> for (i in 1:50) mad(stats::runif(500))
> proc.time() - ptm
   user  system elapsed 
  0.022   0.002   0.038 
> system.time(for(i in 1:100) mad(runif(1000)))
   user  system elapsed 
  0.025   0.002   0.028

这两个函数都只适合测试一整块代码的运行效率。

还有可以想到的工具应该就是rbenchmark了。

> random.array = function(rows, cols, dist=rnorm) 
+                   array(dist(rows*cols), c(rows, cols))
> random.replicate = function(rows, cols, dist=rnorm)
+                       replicate(cols, dist(rows))
> benchmark(replications=rep(100, 3),
+           random.array(100, 100),
+           random.array(100, 100),
+           columns=c('test', 'elapsed', 'replications'))
                    test elapsed replications
1 random.array(100, 100)   0.093          100
2 random.array(100, 100)   0.094          100
3 random.array(100, 100)   0.093          100
4 random.array(100, 100)   0.096          100
5 random.array(100, 100)   0.092          100
6 random.array(100, 100)   0.093          100

rb[……]

Read more

使用BiomaRt获得在线注释信息

完整的生物信息学分析步骤往往会包含注释工作。在Bioconductor中,最方便的办法是使用注释包。注释资源除了以包的形式进行封装外,还可以通过诸如BiomaRt等工具获取在线的注释数据。使用在线资源为我们提供了更加及时以及丰富的注释资源。那么,什么是BiomaRt呢?如何理解和使用BiomaRt呢?

为了更好的理解和掌握biomaRt,我们可以先通过在线资源来了解一下它的原型biomart[……]

Read more

使用RStudio结合devtools开发bioconductor扩展包

在之前的博文如何创建一个R扩展包, 以及R扩展包的写作规范对如何写R扩展包有了初步的介绍。但是当我们需要开发大型的扩展包的时候,完全依靠之前的介绍会让人觉得有些力不从心。并且在扩展包提交给bioconductor之后的2次或者3次开发也并不是那么容易。为此,本文进一步介绍如何使用RStudio结合devtools来开发及维护bioconductor扩展包。

事前依照之前的博文准备好hello[……]

Read more