随着next-generation-sequence技术的不断完善,bioconductor的开发重心也由microarray转移至NGS的应用中来。越来越多的软件包针对NGS。而这些针对NGS的软件包,大多数都是基于IRanges或者GenomicRanges这两个包的。它俩与Biostrings一起,成为了NGS数据结构的基础。
IRanges定义了IRanges类,它的构建函数为:
IRanges(start=NULL, end=NULL, width=NULL, names=NULL)
从IRanges的构建函数很容易看出它描述了一个有起始点及终止点的数据,非常适合用于表示一段DNA序列在基因组上的坐标。但是它并不完善,因为基因组是由不同的染色体组成的,每一个染色体都有正负两条链。于是bioconductor又在IRanges的基础上构建了GRanges数据结构,其构建函数为:
GRanges(seqnames = Rle(), ranges = IRanges(), strand = Rle("*", length(seqnames)), ..., seqlengths = NULL, seqinfo = NULL)
其中seqnames表示了染色体的名称,比如chr1,chr2,…;strand表示正负链,可接受的值为“+”,“-”,以及“*”等。除此之外,它还有seqlengths用于保存不同染色体的长度,seqinfo来保存关于染色体的其它信息等等。
我们来看一个构建GRanges的例子:
> seqinfo <- Seqinfo(paste0("chr", 1:3), c(1000, 2000, 1500), NA, "mock1") > gr <- + GRanges(seqnames = + Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), + ranges = IRanges( + 1:10, width = 10:1, names = head(letters,10)), + strand = Rle( + strand(c("-", "+", "*", "+", "-")), + c(1, 2, 2, 3, 2)), + score = 1:10, + GC = seq(1, 0, length=10), + seqinfo=seqinfo) > gr GRanges with 10 ranges and 2 metadata columns: seqnames ranges strand | score GC <Rle> <IRanges> <Rle> | <integer> <numeric> a chr1 [ 1, 10] - | 1 1 b chr2 [ 2, 10] + | 2 0.888888888888889 c chr2 [ 3, 10] + | 3 0.777777777777778 d chr2 [ 4, 10] * | 4 0.666666666666667 e chr1 [ 5, 10] * | 5 0.555555555555556 f chr1 [ 6, 10] + | 6 0.444444444444444 g chr3 [ 7, 10] + | 7 0.333333333333333 h chr3 [ 8, 10] + | 8 0.222222222222222 i chr3 [ 9, 10] - | 9 0.111111111111111 j chr3 [10, 10] - | 10 0 --- seqlengths: chr1 chr2 chr3 1000 2000 1500 |
我们看到,GRanges在显示上有些类似data.frame,但它其实并非如此。事实上,GRanges更类似vector。我们来看一看我们如何来获取GRanges中的信息。如果我们想获取序列的起止位置,可以使用start和end,如
> start(gr) [1] 1 2 3 4 5 6 7 8 9 10 > end(gr) [1] 10 10 10 10 10 10 10 10 10 10 |
想获取每个序列的seqnames,可以使用seqnames,改变它的值可以使用 seqnames(x) <- value。
> seqlevels(gr) [1] "chr1" "chr2" "chr3" > seqlevels(gr) <- sub("chr", "Chrom", seqlevels(gr)) > gr GRanges with 10 ranges and 2 metadata columns: seqnames ranges strand | score GC <Rle> <IRanges> <Rle> | <integer> <numeric> a Chrom1 [ 1, 10] - | 1 1 b Chrom2 [ 2, 10] + | 2 0.888888888888889 c Chrom2 [ 3, 10] + | 3 0.777777777777778 d Chrom2 [ 4, 10] * | 4 0.666666666666667 e Chrom1 [ 5, 10] * | 5 0.555555555555556 f Chrom1 [ 6, 10] + | 6 0.444444444444444 g Chrom3 [ 7, 10] + | 7 0.333333333333333 h Chrom3 [ 8, 10] + | 8 0.222222222222222 i Chrom3 [ 9, 10] - | 9 0.111111111111111 j Chrom3 [10, 10] - | 10 0 --- seqlengths: Chrom1 Chrom2 Chrom3 1000 2000 1500 > seqlevels(gr) <- sub("Chrom", "chr", seqlevels(gr)) # 变回来 > gr GRanges with 10 ranges and 2 metadata columns: seqnames ranges strand | score GC <Rle> <IRanges> <Rle> | <integer> <numeric> a chr1 [ 1, 10] - | 1 1 b chr2 [ 2, 10] + | 2 0.888888888888889 c chr2 [ 3, 10] + | 3 0.777777777777778 d chr2 [ 4, 10] * | 4 0.666666666666667 e chr1 [ 5, 10] * | 5 0.555555555555556 f chr1 [ 6, 10] + | 6 0.444444444444444 g chr3 [ 7, 10] + | 7 0.333333333333333 h chr3 [ 8, 10] + | 8 0.222222222222222 i chr3 [ 9, 10] - | 9 0.111111111111111 j chr3 [10, 10] - | 10 0 --- seqlengths: chr1 chr2 chr3 1000 2000 1500 |
想获取每个序列的名称,使用names,赋值使用names(x) <- value
> names(gr) [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" > names(gr)[1:3] <- LETTERS[1:3] > gr GRanges with 10 ranges and 2 metadata columns: seqnames ranges strand | score GC <Rle> <IRanges> <Rle> | <integer> <numeric> A chr1 [ 1, 10] - | 1 1 B chr2 [ 2, 10] + | 2 0.888888888888889 C chr2 [ 3, 10] + | 3 0.777777777777778 d chr2 [ 4, 10] * | 4 0.666666666666667 e chr1 [ 5, 10] * | 5 0.555555555555556 f chr1 [ 6, 10] + | 6 0.444444444444444 g chr3 [ 7, 10] + | 7 0.333333333333333 h chr3 [ 8, 10] + | 8 0.222222222222222 i chr3 [ 9, 10] - | 9 0.111111111111111 j chr3 [10, 10] - | 10 0 --- seqlengths: chr1 chr2 chr3 1000 2000 1500 |
想获取每个序列的正负链,使用strand, 赋值使用strand(x) <- value。同样的,获取序列的宽度使用width即可。
> strand(gr) factor-Rle of length 10 with 5 runs Lengths: 1 2 2 3 2 Values : - + * + - Levels(3): + - * > width(gr) [1] 10 9 8 7 6 5 4 3 2 1 |
如果我们想获取序列的score值以及GC值又该如何呢?
对于score值,我们可以使用score(x)来获取,注意,这是非常特殊的一个例,并不意味着在竖线右侧的值都可以使用其列名来获取其值。对于GRanges的列值的获取,其实是通过data.frame的操作来完成的。首先需要使用到mcols(x, use.names=FALSE)来将GRanges竖线右侧的值(又称metadata,故而使用m[etadata]col[umn]s来取值),示例:
> score(gr) [1] 1 2 3 4 5 6 7 8 9 10 > mcols(gr) DataFrame with 10 rows and 2 columns score GC <integer> <numeric> 1 1 1.0000000 2 2 0.8888889 3 3 0.7777778 4 4 0.6666667 5 5 0.5555556 6 6 0.4444444 7 7 0.3333333 8 8 0.2222222 9 9 0.1111111 10 10 0.0000000 > mcols(gr)$GC [1] 1.0000000 0.8888889 0.7777778 0.6666667 0.5555556 0.4444444 0.3333333 0.2222222 0.1111111 [10] 0.0000000 > mcols(gr)$score [1] 1 2 3 4 5 6 7 8 9 10 > gr$GC [1] 1.0000000 0.8888889 0.7777778 0.6666667 0.5555556 0.4444444 0.3333333 0.2222222 0.1111111 [10] 0.0000000 > gr$score [1] 1 2 3 4 5 6 7 8 9 10 > gr$score[1:5] <- 5:1 > gr$score [1] 5 4 3 2 1 6 7 8 9 10 |
上例中,gr$score其实是mcols(gr)$score的缩写。当我们使用mcols(gr)时,其内容就是一个data.frame了。
从上面的例子我们似乎看不到GRanges类似vector的任何地方,又如何说GRanges更类似vector呢?我们试着访问gr的某一行就了解了。
> gr[1:3] GRanges with 3 ranges and 2 metadata columns: seqnames ranges strand | score GC <Rle> <IRanges> <Rle> | <integer> <numeric> A chr1 [1, 10] - | 5 1 B chr2 [2, 10] + | 4 0.888888888888889 C chr2 [3, 10] + | 3 0.777777777777778 --- seqlengths: chr1 chr2 chr3 1000 2000 1500 > gr[c("A","d","f")] GRanges with 3 ranges and 2 metadata columns: seqnames ranges strand | score GC <Rle> <IRanges> <Rle> | <integer> <numeric> A chr1 [1, 10] - | 5 1 d chr2 [4, 10] * | 2 0.666666666666667 f chr1 [6, 10] + | 6 0.444444444444444 --- seqlengths: chr1 chr2 chr3 1000 2000 1500 |
可以看出,访问某一行时,不再使用gr[i,]的形式,而是直接使用gr[i],这可以说是一种简化吧。但看起来就更类vector了。
对于其它操作,可以使用?GRanges来获取帮助。这里就不多言了。
接下来介绍,一些比较常规的运算,看列表:(源http://bioconductor.org/help/course-materials/2013/SeattleFeb2013/IntermediateSequenceAnalysis2013.pdf)
类别 | 函数 | 描述 |
---|---|---|
接口Accessors | start, end, width
names |
获取及赋值开始,终止,以及宽度。需要注意的是直接设置终止值可能会难以通过,可以试着更改宽度来变列终止值 获取及赋值名称 获取及赋值metadata 获取GRanges的长度(或者不准确地说是行数) 获取GRanges的ranges值 |
排序Ordering | <, <=, >, >=, ==, != sort, order, rank duplicated unique |
比较大小, 以起始位置而后宽度排序
排序 |
基本运算Arithmetic | r + x, r – x, r * x shift resize distance restrict |
Shrink or expand ranges r by number x 对位置进行平移 改变宽度 计算距离 截取某一范围内的值 |
集合操作Set operations | reduce intersect, union, setdiff pintersect, punion, psetdiff gaps, pgap disjoin |
将重叠的部分合并 查看交集,并集以及补集运算 对每一对x[i], y[i]进行交集,并集以及补集运算 获取空白的区域 |
重叠运算Overlaps | findOverlaps countOverlaps nearest precede, follow x %in% y |
查找overlaps 计数overlaps 查找最近的 查找之前或者之后的 x中在y中出现过的 |
覆盖运算Coverage | coverage | 计算每个位置的覆盖数 |
取值Extraction | r[i] subsetByOverlaps head, tail, rev, rep |
取某一(或者几)行的值 取有overlaps的部分 传统R运算 |
分离与组合Split, combine | split c |
按某factor分类 组合两个或者多个ranges |
> ranges(gr) IRanges of length 10 start end width names [1] 1 10 10 A [2] 2 10 9 B [3] 3 10 8 C [4] 4 10 7 d [5] 5 10 6 e [6] 6 10 5 f [7] 7 10 4 g [8] 8 10 3 h [9] 9 10 2 i [10] 10 10 1 j > shift(gr,5) GRanges with 10 ranges and 2 metadata columns: seqnames ranges strand | score GC <Rle> <IRanges> <Rle> | <integer> <numeric> A chr1 [ 6, 15] - | 5 1 B chr2 [ 7, 15] + | 4 0.888888888888889 C chr2 [ 8, 15] + | 3 0.777777777777778 d chr2 [ 9, 15] * | 2 0.666666666666667 e chr1 [10, 15] * | 1 0.555555555555556 f chr1 [11, 15] + | 6 0.444444444444444 g chr3 [12, 15] + | 7 0.333333333333333 h chr3 [13, 15] + | 8 0.222222222222222 i chr3 [14, 15] - | 9 0.111111111111111 j chr3 [15, 15] - | 10 0 --- seqlengths: chr1 chr2 chr3 1000 2000 1500 > resize(gr, 2) GRanges with 10 ranges and 2 metadata columns: seqnames ranges strand | score GC <Rle> <IRanges> <Rle> | <integer> <numeric> A chr1 [9, 10] - | 5 1 B chr2 [2, 3] + | 4 0.888888888888889 C chr2 [3, 4] + | 3 0.777777777777778 d chr2 [4, 5] * | 2 0.666666666666667 e chr1 [5, 6] * | 1 0.555555555555556 f chr1 [6, 7] + | 6 0.444444444444444 g chr3 [7, 8] + | 7 0.333333333333333 h chr3 [8, 9] + | 8 0.222222222222222 i chr3 [9, 10] - | 9 0.111111111111111 j chr3 [9, 10] - | 10 0 --- seqlengths: chr1 chr2 chr3 1000 2000 1500 > gr1 <- shift(gr,3) > distance(gr, gr1) [1] 0 0 0 0 0 0 0 0 1 2 > restrict(gr, 5, 7) GRanges with 8 ranges and 2 metadata columns: seqnames ranges strand | score GC <Rle> <IRanges> <Rle> | <integer> <numeric> A chr1 [5, 7] - | 5 1 B chr2 [5, 7] + | 4 0.888888888888889 C chr2 [5, 7] + | 3 0.777777777777778 d chr2 [5, 7] * | 2 0.666666666666667 e chr1 [5, 7] * | 1 0.555555555555556 f chr1 [6, 7] + | 6 0.444444444444444 g chr3 [7, 7] + | 7 0.333333333333333 h chr3 [8, 7] + | 8 0.222222222222222 --- seqlengths: chr1 chr2 chr3 1000 2000 1500 > reduce(gr) GRanges with 7 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr1 [6, 10] + [2] chr1 [1, 10] - [3] chr1 [5, 10] * [4] chr2 [2, 10] + [5] chr2 [4, 10] * [6] chr3 [7, 10] + [7] chr3 [9, 10] - --- seqlengths: chr1 chr2 chr3 1000 2000 1500 > intersect(gr, gr1) GRanges with 6 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr1 [ 9, 10] + [2] chr1 [ 4, 10] - [3] chr1 [ 8, 10] * [4] chr2 [ 5, 10] + [5] chr2 [ 7, 10] * [6] chr3 [10, 10] + --- seqlengths: chr1 chr2 chr3 1000 2000 1500 > punion(gr, gr1,fill.gap=TRUE) GRanges with 10 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> A chr1 [ 1, 13] - B chr2 [ 2, 13] + C chr2 [ 3, 13] + d chr2 [ 4, 13] * e chr1 [ 5, 13] * f chr1 [ 6, 13] + g chr3 [ 7, 13] + h chr3 [ 8, 13] + i chr3 [ 9, 13] - j chr3 [10, 13] - --- seqlengths: chr1 chr2 chr3 1000 2000 1500 > gaps(gr) GRanges with 15 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr1 [ 1, 5] + [2] chr1 [11, 1000] + [3] chr1 [11, 1000] - [4] chr1 [ 1, 4] * [5] chr1 [11, 1000] * ... ... ... ... [11] chr3 [ 1, 6] + [12] chr3 [11, 1500] + [13] chr3 [ 1, 8] - [14] chr3 [11, 1500] - [15] chr3 [ 1, 1500] * --- seqlengths: chr1 chr2 chr3 1000 2000 1500 > disjoin(gr) GRanges with 10 ranges and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr1 [ 6, 10] + [2] chr1 [ 1, 10] - [3] chr1 [ 5, 10] * [4] chr2 [ 2, 2] + [5] chr2 [ 3, 10] + [6] chr2 [ 4, 10] * [7] chr3 [ 7, 7] + [8] chr3 [ 8, 10] + [9] chr3 [ 9, 9] - [10] chr3 [10, 10] - --- seqlengths: chr1 chr2 chr3 1000 2000 1500 > findOverlaps(gr, gr1) Hits of length 18 queryLength: 10 subjectLength: 10 queryHits subjectHits <integer> <integer> 1 1 1 2 1 5 3 2 2 4 2 3 5 2 4 ... ... ... 14 5 6 15 6 5 16 6 6 17 7 7 18 8 7 > countOverlaps(gr, gr1) A B C d e f g h i j 2 3 3 3 3 2 1 1 0 0 > gr %in% gr1 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > coverage(gr) RleList of length 3 $chr1 integer-Rle of length 1000 with 4 runs Lengths: 4 1 5 990 Values : 1 2 3 0 $chr2 integer-Rle of length 2000 with 5 runs Lengths: 1 1 1 7 1990 Values : 0 1 2 3 0 $chr3 integer-Rle of length 1500 with 6 runs Lengths: 6 1 1 1 1 1490 Values : 0 1 2 3 4 0 > gr[1:3] GRanges with 3 ranges and 2 metadata columns: seqnames ranges strand | score GC <Rle> <IRanges> <Rle> | <integer> <numeric> A chr1 [1, 10] - | 5 1 B chr2 [2, 10] + | 4 0.888888888888889 C chr2 [3, 10] + | 3 0.777777777777778 --- seqlengths: chr1 chr2 chr3 1000 2000 1500 > gr[3] GRanges with 1 range and 2 metadata columns: seqnames ranges strand | score GC <Rle> <IRanges> <Rle> | <integer> <numeric> C chr2 [3, 10] + | 3 0.777777777777778 --- seqlengths: chr1 chr2 chr3 1000 2000 1500 |
非常实用and详细