第七章 下游分析
Chapter 7: downstream analysis

富集分析
Enrichment analysis

富集分析是一种常见的下游分析手段。简单的讲,富集分析就是一种统计分析的手段,用来筛选功能相类的一组基因是否富集中差异表达的基因中。当人们拿到了差异表达的基因时,很多时候因为差异表达的基因数量很多,面对这么多的基因,人们不知道如何找到合适的突破口进行下游的验证实验。于是从一堆差异表达的基因中找出有意义的基因进行RT-PCR验证以及基因敲除验证就需要使用到富集分析了。

在R/Bioconductor中最为常用的软件包就是GOstats[1]了。也有很多优秀的软件包,比如goProfiles, clusterProfiler等。但是笔者个人认为,后来涌现的这些软件的健壮性都不如GOstats。所以在此,还是使用GOstats做为例子,示范如何从基因名来进行富集分析。

对于富集分析,需要注意的问题是,如何选择正确的库(library, 或者说是universe gene id)。正确的库的大小的选择决定着分析的结果是否可靠。比如说,如果对Targeted rna-seq结果还使用全基因组做为universeGeneIDs参数的话,就是明显错误的,在这种情况下,就应该输入你所选择的基因做为universeGeneIDs进行计算。很多时候有人会问,在差异表达分析的过程中,我们常常会发现一些新的基因,这些基因是不是也应该考虑进来?答案是否定的。因为只有获得分组注释的基因才是我们应该考虑的对象,对于还没有注释的基因,是无需要考虑的。

还有一种特殊的富集分析手段叫Gene Set Enrichment Analysis GSEA. 它是一个java应用,非常方便,因为不是R的,所以我就不示例了。使用GSEA时要注意两条,1,如果是RNA-seq的话,要使用pre-ranked GSEA。2, GSEA并没有推出human以外的物种数据,所以只能通过分析同源基因的方式来使用GSEA。

下面就是示例代码。

enrich <- function(entrezIDs, orgDbName="org.Hs.eg.db", pvalueCutoff=.01){
    require(orgDbName, character.only=TRUE)
    require("GSEABase")
    require("GOstats")
    require("Category")
    goAnn <- get(gsub(".db", "GO", orgDbName))
    universe <- Lkeys(goAnn)
    onto <- c("BP", "MF", "CC")
    res <- lapply(onto, function(.onto){
        param <- new('GOHyperGParams',
                     geneIds= entrezIDs,
                     universeGeneIds=universe,
                     annotation=orgDbName,
                     ontology=.onto,
                     pvalueCutoff=pvalueCutoff,
                     conditional=FALSE,
                     testDirection="over")
        over <- hyperGTest(param)
        glist <- geneIdsByCategory(over)
        glist <- sapply(glist, function(.ids) {
            .sym <- mget(.ids, envir=get(gsub(".db", "SYMBOL", orgDbName)), ifnotfound=NA)
            .sym[is.na(.sym)] <- .ids[is.na(.sym)]
            paste(.sym, collapse=";")
        })
        summary <- summary(over)
        if(nrow(summary)>1) summary$Symbols <- glist[as.character(summary[, 1])]
        summary
    })
    names(res) <- onto
    keggAnn <- get(gsub(".db", "PATH", orgDbName))
    universe <- Lkeys(keggAnn)
    param <- new("KEGGHyperGParams",
                 geneIds=entrezIDs,
                 universeGeneIds=universe,
                 annotation=orgDbName,
                 categoryName="KEGG",
                 pvalueCutoff=pvalueCutoff,
                 testDirection="over")
    over <- hyperGTest(param)
    kegg <- summary(over)
    glist <- geneIdsByCategory(over)
    glist <- sapply(glist, function(.ids) {
        .sym <- mget(.ids, envir=get(gsub(".db", "SYMBOL", orgDbName)), ifnotfound=NA)
        .sym[is.na(.sym)] <- .ids[is.na(.sym)]
        paste(.sym, collapse=";")
    })
    kegg$Symbols <- glist[as.character(kegg$KEGGID)]
    res[["kegg"]] <- kegg
    res
}
library("org.Hs.eg.db")
genes <- c("AREG", "FKBP5", "CXCL13", "KLF9", "ZC3H12A", "P4HA1", "TLE1", 
           "CREB3L2", "TXNIP", "PBX1", "GJA1", "ITGB8", "CCL3", "CCND2", 
           "KCNJ15", "CFLAR", "CXCL10", "CYSLTR1", "IGFBP7", "RHOB", 
           "MAP3K5", "CAV2", "CAPN2", "AKAP13", "RND3", "IL6ST", "RGS1", 
           "IRF4", "G3BP1", "SEL1L", "VEGFA", "SMAD1", "CCND1", "CLEC3B", 
           "NEB", "AMD1", "PDCD4", "SCD", "TM2D3", "BACH2", "LDLR", "BMPR1B", 
           "RFXAP", "ASPH", "PTK2B", "SLC1A5", "ENO2", "TRPM8", "SATB1", 
           "MIER1", "SRSF1", "ATF3", "CCL5", "MCM6", "GCH1", "CAV1", "SLC20A1")
goAnn <- get("org.Hs.egGO")
entrezIDs <- mget(genes, org.Hs.egSYMBOL2EG, ifnotfound=NA)
go <- enrich(entrezIDs, "org.Hs.eg.db", .05)
## 
## KEGG.db contains mappings based on older data because the original
##   resource was removed from the the public domain before the most
##   recent update was produced. This package should now be
##   considered deprecated and future versions of Bioconductor may
##   not have it available.  Users who want more current data are
##   encouraged to look at the KEGGREST or reactome.db packages

对于富集分析,有很多包对上面的步骤进行了包装,使得人们更容易掌握。比较常用的软件包有clusterProfiler。下面就是对这个包的代码示例。

Gene Ontology

library(clusterProfiler)
library(org.Hs.eg.db)
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]
gene.df <- bitr(gene, fromType = "ENTREZID",
        toType = c("ENSEMBL", "SYMBOL"),
        OrgDb = org.Hs.eg.db)
ego <- enrichGO(gene          = gene,
                universe      = names(geneList),
                OrgDb         = org.Hs.eg.db,
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05,
                readable      = TRUE)

KEGG

kk <- enrichKEGG(gene         = gene,
                 organism     = 'hsa',
                 pvalueCutoff = 0.05)

GSEA

ggo <- gseGO(geneList     = geneList,
              OrgDb        = org.Hs.eg.db,
              ont          = "CC",
              nPerm        = 1000,
              minGSSize    = 100,
              maxGSSize    = 500,
              pvalueCutoff = 0.05,
              verbose      = FALSE)

gkk <- gseKEGG(geneList     = geneList,
               organism     = 'hsa',
               nPerm        = 1000,
               minGSSize    = 120,
               pvalueCutoff = 0.05,
               verbose      = FALSE)

热图
heatmap

我们在分析了差异表达数据之后,经常要生成一种直观图--热图(heatmap)。热图是一种三维数据转二维的表达手段,它通过使用不同的颜色来表达本应该在第三维上显示的数据。热图的方便的于,它可以在一张二维图像上快速的通过颜色的差异来了解组内的一致性及差异组间的差别。

热图的软件包有很多,常用的有pheatmap, ggplot2以及ComplexHeatmap[2]。在这些软件包中,笔者最喜欢ComplexHeatmap,因为它非常健壮,并且功能强大,作者充满了幽默感。

下面就示例如何使用ComplexHeatmap来绘制热图。

## 首先, 我们先生成一个随机数据。
set.seed(1)
mat <- matrix(runif(36), nrow=6, 
              dimnames=list(paste("row", letters[1:6], sep="_"), 
                            paste("col", LETTERS[1:6], sep="_")))
library(ComplexHeatmap)
## ========================================
## ComplexHeatmap version 1.18.1
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://bioconductor.org/packages/ComplexHeatmap/
## 
## If you use it in published research, please cite:
## Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
##   genomic data. Bioinformatics 2016.
## ========================================
Heatmap(mat)

ComplexHeatmap的作者给出了很多非常实用的实例:bioinformatics以及Bioconductor。本文的目的在于推广ComplexHeatmap的使用。

ComplexHeatmap设计了两个重要的类,一个是Heatmap,一个是HeatmapAnnotation。多个Heatmap类还可以组成HeatmapList,HeatmapList可以将多个heatmap肩并肩平行放置。

ComplexHeatmap的布局设计
图 1.9.1 - ComplexHeatmap的布局设计

在绘图上ComplexHeatmap使用了与ggplot2类似的加法运算符('+'),但是其背后使用的是grid包。因此作者特意保留了所有的grid中的viewport,所以可以对每个部分进行多层修饰。但是如果用户没有使用过grid的话,可能会比较难理解。

下面我们举两个例子,来讲解如何使用ComplexHeatmap来绘制复杂的热图。

Heatmap和annotation

我尽量使用ComplexHeatmap中原有的例子来讲解。

ha_column <- 
  HeatmapAnnotation(
    df = data.frame(## data.frame, 行数与mat的列数一致。
      type1 = sample(c("a", "b"), 
                     ncol(mat), 
                     replace = TRUE)),
    col = list(type1 = c("a" = "red", "b" = "blue")) 
    ## 颜色list, 名字与df中的列字一致,
    ##           元素名与df列中的出现的元素一致。
)
ha_row <- rowAnnotation( ## 对行进行注释
  df = data.frame(type2 = sample(c("A", "B"),
                                 nrow(mat), 
                                 replace = TRUE)),
  col = list(type2 = c("A" = "green", "B" = "cyan")),
  width = unit(1, "cm") ##指定宽度,必须是unit,参见grid::unit
)

## 生成两个Heatmap实例。这里我们使用了相同的数据。
## 在实际应用中,它应该是行数一致的不同数据。
ht1 <- Heatmap(mat, name = "ht1", #name将出现的图例中
               column_title = "Heatmap 1", #title将出现在标题中
               top_annotation = ha_column) #ha_column将出现在热图上方(也可以是下方)
ht2 <- Heatmap(mat, name = "ht2", 
               column_title = "Heatmap 2")
ht_list <- ht1 + ht2 + ha_row
draw(ht_list)

也可以在画图的过程中对图例进行调整。

draw(ht_list, heatmap_legend_side = "left", 
     show_annotation_legend = FALSE)

从上面的步骤中,我们可以看到,使用Heatmap可以实例化一个Heatmap,使用HeatmapAnnotation可以实例化一个HeatmapAnnotation。这些实例可以通过'+'号连接起来。画图使用draw或者print命令(注意,因为我们在interactive模式下是自动调用show命令的,所以系统会直接画图,如果不是在interactive模式下,一定要使用draw或者pirnt命令来出图)。

当然annotation可以是多种图型,其中包括points, boxplot, barplot, density_line, violin, histogram, 甚至heatmap。

ha_mix_top <- HeatmapAnnotation(
  points = anno_points(runif(ncol(mat))), ## points
  barplot = anno_barplot(runif(ncol(mat)), axis=TRUE), ## barplot
  histogram = anno_histogram(mat, 
                             gp = gpar(fill = 1:6)), ## histogram
  density_line = anno_density(mat, type = "line", 
                              gp = gpar(col = 1:6)), ## density
  violin = anno_density(mat, type = "violin", gp = gpar(fill = 6:1)), ## violin
  box = anno_boxplot(mat), ## boxplot
  heatmap = anno_density(mat, type = "heatmap") ## heatmap
)
Heatmap(mat, name = "mat", 
        top_annotation = ha_mix_top, 
        top_annotation_height = unit(8, "cm"))

对于rowAnnotation,以上各种类型的注释,可以使用row_前缀,比如

  • row_anno_points()
  • row_anno_barplot()
  • row_anno_boxplot()
  • row_anno_histogram()
  • row_anno_density()
ha1 <- rowAnnotation(b1 = row_anno_boxplot(mat, axis = TRUE),
    p1 = row_anno_points(runif(nrow(mat)), axis = TRUE),
    width = unit(4, "cm"))
ha2 <- rowAnnotation(b2 = row_anno_boxplot(mat, axis = TRUE),
    p2 = row_anno_points(rowMeans(mat), axis = TRUE), 
    width = unit(4, "cm"))
Heatmap(mat, name = "foo", top_annotation = ha_mix_top, 
        top_annotation_height = unit(8, "cm")) + ha1 + ha2

#在现有注释的基础上,可以使用`decorate_annotation`来对图像近一步修饰。

decorate_annotation("barplot", ## 第一个参数是要找的annotation的名字
                    {##第二个参数是要运行的命令。
  grid.text("barplot", unit(-10, "mm"), just="bottom", rot=90)
  grid.lines(c(0, 1), unit(c(.5, .5), "native"), ##这里的native的意思是指实际坐标系。 
             gp=gpar(lty=2, col="blue"))
})

decorate_annotation("p2", {
  grid.lines(unit(c(.5, .5), "native"), c(0, 1),
             gp=gpar(col="red"))
})

如果想只对个别的行或者列标示名称,其它的不标注,如何实现呢?

mat = matrix(rnorm(10000), nr = 1000)
rownames(mat) = sprintf("%.2f", rowMeans(mat))
subset = sample(1000, 20)
labels = rownames(mat)[subset]
Heatmap(mat, show_row_names = FALSE, show_row_dend = FALSE, show_column_dend = FALSE) + 
rowAnnotation(link = row_anno_link(at = subset, labels = labels),
  width = unit(1, "cm") + max_text_width(labels))

OncoPrint

cBioPortal给我们带来了lollipop plot以及OncoPrint来表示基因中的突变情况。这里的oncoPrint就是对OncoPrint的R包装。

mat <- read.table(textConnection(
",s1,s2,s3
g1,snv1;indel,snv1;snv2,indel
g2,,snv2;indel,snv1
g3,snv1,,indel;snv1;snv2"), 
row.names = 1, header = TRUE, 
sep = ",", stringsAsFactors = FALSE)
mat <- as.matrix(mat)
mat
##    s1           s2           s3               
## g1 "snv1;indel" "snv1;snv2"  "indel"          
## g2 ""           "snv2;indel" "snv1"           
## g3 "snv1"       ""           "indel;snv1;snv2"
library(ComplexHeatmap)
col <- c(snv1 = "#008000", snv2 = "#008000", indel = "blue")
oncoPrint(mat, get_type = function(x) strsplit(x, ";")[[1]],
    alter_fun = list(
      indel = function(x, y, w, h) 
        grid.rect(x, y, w*0.9, h*0.9, 
                  gp = gpar(fill = col["indel"], 
                            col = NA)) , #must be draw first
        snv1 = function(x, y, w, h) 
          grid.rect(x, y-.15*h, w*0.9, h*0.2, 
                    gp = gpar(fill = col["snv1"], 
                              col = NA)),
        snv2 = function(x, y, w, h) 
          grid.rect(x, y+.15*h, w*0.9, h*0.2, 
                    gp = gpar(fill = col["snv2"], 
                              col = NA))
    ), col = col)

上面的例子有两个重要的概念,一个是get_type,一个是alter_fun。两者都接受一个函数,其中get_type是如何将mat中的每个单元如何分解成突变类型,alter_fun是如何对每一种突变类型进行绘图。 get_type接受一个参数x,这个x将会是mat中的每一个值。比如这里的mat[g1, s1]就是"snv1;indel"。而alter_fun将会是一个list of function,list中的每个元素名称都将是可能出现在get_type得到的值中。list中的每一个函数有四个参数,x,y,w,h,它们分别指heatmap中对应mat中的每个单元格的x, y的坐标以及单元格的高度和宽度。

现在出一个问题:如果我们希望所有的snv都合并考虑,当只有一个snv的时候,画在cell的中间,如果有两个的时候,对称的画两条。也就是说,无论是snv1还是snv2,如果只出现其中一个,都画在正中间。这时应该怎么做呢?

我的想法是对每一个cell设置一个全局变量, 画两次。也许有更好的办法,希望同学们留言交流。

rm(list=ls(pattern="cell_.*_.*"))
max <- 0
snv <- function(x, y, w, h) {
  globalV <- paste("cell", x, y, sep="_")
  cnt <- 1
  if(exists(globalV)){
    cnt <- get(globalV, envir=globalenv())+1
  }
  assign(globalV, cnt, envir=globalenv())
  if(cnt>max) assign("max", cnt, envir=globalenv())
  grid.rect(x, y, w*0.9, h*0.2, 
            gp = gpar(fill = col["snv1"], 
                      col = NA))
}
oncoPrint(mat, get_type = function(x) strsplit(x, ";")[[1]],
    alter_fun = list(
      indel = function(x, y, w, h) 
        grid.rect(x, y, w*0.9, h*0.9, 
                  gp = gpar(fill = col["indel"], 
                            col = NA)) , #must be draw first
        snv1 = snv,
        snv2 = snv
    ), col = col)

H.factor <- 1 / ( 2 * max - 1)
rm(list=ls(pattern="cell.cnt_.*_.*"))
snv <- function(x, y, w, h) {
  globalV <- paste("cell", x, y, sep="_")
  total <- get(globalV, envir=globalenv())
  localV <- paste("cell.cnt", x, y, sep="_")
  if(exists(localV)){
    cnt <- get(localV, envir=globalenv())+1
  }else{
    cnt <- 1
  }
  assign(localV, cnt, envir=globalenv())
  pos.y <- seq(from=0, to=1, length.out=total+1)
  pos.y <- (pos.y[-length(pos.y)] + pos.y[-1]) / 2
  pos.y <- (pos.y[cnt] - .5 ) * h
  grid.rect(x, y + pos.y, w*0.9, h*H.factor, 
            gp = gpar(fill = col["snv1"], 
                      col = NA))
}
oncoPrint(mat, get_type = function(x) strsplit(x, ";")[[1]],
    alter_fun = list(
      indel = function(x, y, w, h) 
        grid.rect(x, y, w*0.9, h*0.9, 
                  gp = gpar(fill = col["indel"], 
                            col = NA)) , #must be draw first
        snv1 = snv,
        snv2 = snv
    ), col = col)

下面就是一个实际的例子,其实也就是ComplexHeatmap中的实例。

mat <- read.delim(file.path(system.file("extdata", "tcga_lung_adenocarcinoma_provisional_ras_raf_mek_jnk_signalling.txt",
                                       package = "ComplexHeatmap")), 
                 stringsAsFactors=FALSE, row.names=1)
mat[is.na(mat)] <- "  "
mat <-  mat[, -ncol(mat)]
mat = t(as.matrix(mat))
mat[1:3, 1:3]
##      TCGA-05-4384-01 TCGA-05-4390-01 TCGA-05-4425-01
## KRAS "  "            "MUT;"          "  "           
## HRAS "  "            "  "            "  "           
## BRAF "  "            "  "            "  "
unique(unlist(strsplit(mat, ";")))
## [1] "  "     "MUT"    "AMP"    "HOMDEL"

我们可以看到,这里有三种不同的突变类型,“MUT”,“AMP”,“HOMDEL”。我们要对它们分别写出alter_fun

alter_fun <- list(
    background = function(x, y, w, h) {
        grid.rect(x, y, w-unit(0.5, "mm"), h-unit(0.5, "mm"), gp = gpar(fill = "#CCCCCC", col = NA))
    },
    HOMDEL = function(x, y, w, h) {
        grid.rect(x, y, w-unit(0.5, "mm"), h-unit(0.5, "mm"), gp = gpar(fill = "blue", col = NA))
    },
    AMP = function(x, y, w, h) {
        grid.rect(x, y, w-unit(0.5, "mm"), h-unit(0.5, "mm"), gp = gpar(fill = "red", col = NA))
    },
    MUT = function(x, y, w, h) {
        grid.rect(x, y, w-unit(0.5, "mm"), h*0.33, gp = gpar(fill = "#008000", col = NA))
    }
)

同时对我们使用到的颜色进行定义,它们将用在上方和右方的统计中。

col = c("MUT" = "#008000", "AMP" = "red", "HOMDEL" = "blue")

使用remove_empty_columns来去除没有突变的样品。 使用column_order以及row_order来按需要设定样品顺序。

sample_order <- scan(system.file("extdata", "sample_order.txt",
                                package = "ComplexHeatmap"), 
                    what = "character")
ht <- oncoPrint(mat, get_type = function(x) strsplit(x, ";")[[1]],
                alter_fun = alter_fun, col = col, 
                row_order = NULL, column_order = sample_order,
                remove_empty_columns = TRUE,
                column_title = "OncoPrint for TCGA Lung Adenocarcinoma, genes in Ras Raf MEK JNK signalling",
                heatmap_legend_param = 
                  list(title = "Alternations", 
                       at = c("AMP", "HOMDEL", "MUT"), 
                       labels = c("Amplification", 
                                  "Deep deletion", 
                                  "Mutation"),
                       nrow = 1, title_position = "leftcenter"))
draw(ht, heatmap_legend_side = "bottom")

基因网络建立
Gene regulation network building by GeneNetworkBuilder

在实验中,我们常常会将RNAseq的结果和ChIPseq的结果结合起来,以利于提高下游的验证率,在这个过程中,可以使用GeneNetworkBuilder来查找高验证率的目标基因。GeneNetworkBuilder需要两个实验输入,一个是全基因 组的表达图谱,一个是目标蛋白的目标基因。GeneNetworkbuilder还需要该物种的调控网络数据,该数据是为了联接实验数据迷失的目标基因。

GeneNetworkBuilder的输入与输出
图 1.9.2 - GeneNetworkBuilder的输入与输出

对于建立这类的调控网络,目前最大的问题是得到的调控网络往往是一个难以理清的毛线团,有一种牵一发而动全身的感觉。所以,下游的过滤分析是极为重要的。现阶段可能的几个过滤手段大多都是基于富集分析的,然而往往结果并不是非常理想。

library(GeneNetworkBuilder)
library(simpIntLists)
i <- findInteractionList("human", "Official")
i <- lapply(i, function(.ele) cbind(from=as.character(.ele$name), to=as.character(.ele$interactors)))
i <- do.call(rbind, i)
set.seed(123)
## generate a random ChIP-seq binding table
rootgene <- sample(i[, 1], 1)
TFbindingTable <- i[i[, 1] == rootgene, ]
interactionmap <- i
# build network
sifNetwork<-buildNetwork(TFbindingTable=TFbindingTable, 
                        interactionmap=interactionmap, level=2)
ID=unique(as.character(sifNetwork))
## create a random expression data
expressionData <- data.frame(ID=ID,
                             logFC=sample(-3:3, length(ID), replace=TRUE),
                             P.Value=runif(n=length(ID), max=0.25))
## filter network
cifNetwork<-filterNetwork(rootgene=rootgene, sifNetwork=sifNetwork, 
                    exprsData=expressionData, mergeBy="ID",
                    miRNAlist=character(0), 
                    tolerance=1, cutoffPVal=0.01, cutoffLFC=1)
## polish network
gR<-polishNetwork(cifNetwork)
## browse network
# browseNetwork(gR) # interactive htmlwidget; 
## or plot by Rgraphviz
library(Rgraphviz)
plotNetwork<-function(gR, layouttype="dot", ...){
    if(!is(gR,"graphNEL")) stop("gR must be a graphNEL object")
    if(!(GeneNetworkBuilder:::inList(layouttype, c("dot", "neato", "twopi", "circo", "fdp")))){
        stop("layouttype must be dot, neato, twopi, circo or fdp")
    }
    g1<-Rgraphviz::layoutGraph(gR, layoutType=layouttype, ...)
    nodeRenderInfo(g1)$col <- nodeRenderInfo(gR)$col
    nodeRenderInfo(g1)$fill <- nodeRenderInfo(gR)$fill
    renderGraph(g1)
}
plotNetwork(gR)

查看信号
Visualization by trackViewer

当拿到ChIPseq的结果后,我们可以使用众多手段来查看reads的真实情况,比如说使用IGV, UCSC genome browser等。但是,当大家需要把这个track生成图片发表时,这些工具提供的图片输出有时候无法达到发表的要求,于是很多软件包就因此而生。最常使用到的软件包是Gviz。然而,我想大家也看出来了,本文夹私的现象非常严重,所以,我重点介绍一下trackViewer软件包。

trackViewer软件包不但可以输出高品质的track图,还可以生成高颜值的lollipop plot。甚至,还在推广一种名为Dandelion plot的表示SNP以及Indel数据的可视化图形。

library(trackViewer)
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
gr <- parse2GRanges("chr3:108,476,000-108,485,000")
gr # interesting genomic locations
## GRanges object with 1 range and 0 metadata columns:
##       seqnames              ranges strand
##          <Rle>           <IRanges>  <Rle>
##   [1]     chr3 108476000-108485000      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
library(GenomicFeatures)# load GenomicFeatures can create TxDb from UCSC
if(interactive()){
    mm8KG <- makeTxDbFromUCSC(genome="mm8", tablename="knownGene")
    saveDb(mm8KG, "mm8KG.sqlite")
}else{## mm8KG was saved as sqlite file
    mm8KG <- loadDb("data/mm8KG.sqlite")
}
library(org.Mm.eg.db) # load annotation database
##
## create the gene model tracks information
trs <- geneModelFromTxdb(mm8KG, org.Mm.eg.db, gr=gr)
## import data from bedGraph/bigWig/BED ... files, see ?importScore for details
CLIP <- importScore("data/CLIP.bedGraph", format="bedGraph", ranges=gr)
control <- importScore("data/control.bedGraph", format="bedGraph", ranges=gr)
knockdown <- importScore("data/knockdown.bedGraph", format="bedGraph", ranges=gr)
## create styles by preset theme
optSty <- optimizeStyle(trackList(trs, knockdown, control, CLIP), theme="col")
trackList <- optSty$tracks
viewerStyle <- optSty$style
## adjust the styles for this track
### rename the trackList for each track
names(trackList)[1:2] <- paste0("Sort1: ", names(trackList)[1:2])
names(trackList)[3] <- "RNA-seq TDP-43 KD"
names(trackList)[4] <- "RNA-seq control"
### change the lab positions for gene model track to bottomleft
setTrackStyleParam(trackList[[1]], "ylabpos", "bottomleft")
setTrackStyleParam(trackList[[2]], "ylabpos", "bottomleft")
### change the color of gene model track
setTrackStyleParam(trackList[[1]], "ylabgp", list(cex=1, col="red"))
setTrackStyleParam(trackList[[2]], "ylabgp", list(cex=1, col="green"))
### remove the xaxis
setTrackViewerStyleParam(viewerStyle, "xaxis", FALSE)
### add a scale bar in CLIP track
setTrackXscaleParam(trackList[[5]], "draw", TRUE)
## plot the tracks
vp <- viewTracks(trackList, gr=gr, viewerStyle=viewerStyle)
## Found more than one class "pos" in cache; using the first, from namespace 'motifStack'
## Also defined by 'trackViewer'
### add guide lines to show the range of CLIP-seq signal
addGuideLine(c(108481252, 108481887), vp=vp)
### add arrow mark to show the alternative splicing event
addArrowMark(list(x=c(108483570, 108483570), 
                  y=c(3, 4)), 
             label=c("Inclusive\nexon", ""), 
             col=c("blue", "cyan"), 
             vp=vp, quadrant=1)

因为motif搜索需要较多的CPU时间,所以大家都使用MEME Suite或者Homer来进行motif搜索。在Bioconductor中,只要是对motif图形化展示的包比较多。这里介绍motifStack包。

motifStack包不但可以画单一个DNA,RNA或者氨基酸logo,还可以同时画多个logo。

library(motifStack)
pcm <- read.table(file.path(find.package("motifStack"), 
                            "extdata", "bin_SOLEXA.pcm"))
pcm <- pcm[,3:ncol(pcm)]
rownames(pcm) <- c("A","C","G","T")
motif <- new("pcm", mat=as.matrix(pcm), name="bin_SOLEXA")
plot(motif)

## plot a RNA sequence logo
rna <- pcm
rownames(rna)[4] <- "U"
motif <- new("pcm", mat=as.matrix(rna), name="RNA_motif")
plot(motif)

## plot an amino acid sequence logo
protein<-read.table(file.path(find.package("motifStack"),"extdata","cap.txt"))
protein<-t(protein[,1:20])
motif<-pcm2pfm(protein)
motif<-new("pfm", mat=motif, name="CAP", 
            color=colorset(alphabet="AA",colorScheme="chemistry"))
plot(motif)

## plot an affinity logo
motif<-matrix(
  c(
    .846, .631, .593, .000, .000, .000, .434, .410, 1.00, .655, .284, .000, .000, .771, .640, .961,
    .625, .679, .773, 1.00, 1.00, .000, .573, .238, .397, 1.00, 1.00, .000, .298, 1.00, 1.00, .996,
    1.00, 1.00, 1.00, .228, .000, 1.00, 1.00, .597, .622, .630, .000, 1.00, 1.00, .871, .617, 1.00,
    .701, .513, .658, .000, .000, .247, .542, 1.00, .718, .686, .000, .000, .000, .595, .437, .970
  ), nrow=4, byrow = TRUE)
rownames(motif) <- c("A", "C", "G", "T")
motif<-new("psam", mat=motif, name="affinity logo")
plot(motif)

## plot sequence logo stack
motifs<-importMatrix(dir(file.path(find.package("motifStack"), "extdata"),"pcm$", full.names = TRUE))

## plot stacks
motifStack(motifs[1:4], layout="stack", ncex=1.0)

1Falcon, Seth and Gentleman, Robert, Using GOstats to test gene lists for GO term association, Oxford University Press, 2006.
2Gu, Zuguang et al., Complex heatmaps reveal patterns and correlations in multidimensional genomic data, Oxford University Press, 2016.

results matching ""

    No results matching ""