bioconductor基础IRanges/GenomicRanges

随着next-generation-sequence技术的不断完善,bioconductor的开发重心也由microarray转移至NGS的应用中来。越来越多的软件包针对NGS。而这些针对NGS的软件包,大多数都是基于IRanges或者GenomicRanges这两个包的。它俩与Biostrings一起,成为了NGS数据结构的基础。

IRanges定义了IRanges类,它的构建函数为:

IRanges(start=NULL, end=NULL, width=NULL, names=NULL)

从IRanges的构建函数很容易看出它描述了一个有起始点及终止点的数据,非常适合用于表示一段DNA序列在基因组上的坐标。但是它并不完善,因为基因组是由不同的染色体组成的,每一个染色体都有正负两条链。于是bioconductor又在IRanges的基础上构建了GRanges数据结构,其构建函数为:

GRanges(seqnames = Rle(), ranges = IRanges(), strand = Rle("*", length(seqnames)), ..., seqlengths = NULL, seqinfo = NULL)

其中seqnames表示了染色体的名称,比如chr1,chr2,…;strand表示正负链,可接受的值为“+”,“-”,以及“*”等。除此之外,它还有seqlengths用于保存不同染色体的长度,seqinfo来保存关于染色体的其它信息等等。

我们来看一个构建GRanges的例子:

> seqinfo <- Seqinfo(paste0("chr", 1:3), c(1000, 2000, 1500), NA, "mock1")
> gr <-
+   GRanges(seqnames =
+           Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
+           ranges = IRanges(
+             1:10, width = 10:1, names = head(letters,10)),
+           strand = Rle(
+             strand(c("-", "+", "*", "+", "-")),
+             c(1, 2, 2, 3, 2)),
+           score = 1:10,
+           GC = seq(1, 0, length=10),
+           seqinfo=seqinfo)
> gr
GRanges with 10 ranges and 2 metadata columns:
    seqnames    ranges strand |     score                GC
       <Rle> <IRanges>  <Rle> | <integer>         <numeric>
  a     chr1  [ 1, 10]      - |         1                 1
  b     chr2  [ 2, 10]      + |         2 0.888888888888889
  c     chr2  [ 3, 10]      + |         3 0.777777777777778
  d     chr2  [ 4, 10]      * |         4 0.666666666666667
  e     chr1  [ 5, 10]      * |         5 0.555555555555556
  f     chr1  [ 6, 10]      + |         6 0.444444444444444
  g     chr3  [ 7, 10]      + |         7 0.333333333333333
  h     chr3  [ 8, 10]      + |         8 0.222222222222222
  i     chr3  [ 9, 10]      - |         9 0.111111111111111
  j     chr3  [10, 10]      - |        10                 0
  ---
  seqlengths:
   chr1 chr2 chr3
   1000 2000 1500

我们看到,GRanges在显示上有些类似data.frame,但它其实并非如此。事实上,GRanges更类似vector。我们来看一看我们如何来获取GRanges中的信息。如果我们想获取序列的起止位置,可以使用start和end,如

> start(gr)
 [1]  1  2  3  4  5  6  7  8  9 10
> end(gr)
 [1] 10 10 10 10 10 10 10 10 10 10

想获取每个序列的seqnames,可以使用seqnames,改变它的值可以使用 seqnames(x) <- value。

> seqlevels(gr)
[1] "chr1" "chr2" "chr3"
> seqlevels(gr) <- sub("chr", "Chrom", seqlevels(gr))
> gr
GRanges with 10 ranges and 2 metadata columns:
    seqnames    ranges strand |     score                GC
       <Rle> <IRanges>  <Rle> | <integer>         <numeric>
  a   Chrom1  [ 1, 10]      - |         1                 1
  b   Chrom2  [ 2, 10]      + |         2 0.888888888888889
  c   Chrom2  [ 3, 10]      + |         3 0.777777777777778
  d   Chrom2  [ 4, 10]      * |         4 0.666666666666667
  e   Chrom1  [ 5, 10]      * |         5 0.555555555555556
  f   Chrom1  [ 6, 10]      + |         6 0.444444444444444
  g   Chrom3  [ 7, 10]      + |         7 0.333333333333333
  h   Chrom3  [ 8, 10]      + |         8 0.222222222222222
  i   Chrom3  [ 9, 10]      - |         9 0.111111111111111
  j   Chrom3  [10, 10]      - |        10                 0
  ---
  seqlengths:
   Chrom1 Chrom2 Chrom3
     1000   2000   1500
> seqlevels(gr) <- sub("Chrom", "chr", seqlevels(gr)) # 变回来
> gr
GRanges with 10 ranges and 2 metadata columns:
    seqnames    ranges strand |     score                GC
       <Rle> <IRanges>  <Rle> | <integer>         <numeric>
  a     chr1  [ 1, 10]      - |         1                 1
  b     chr2  [ 2, 10]      + |         2 0.888888888888889
  c     chr2  [ 3, 10]      + |         3 0.777777777777778
  d     chr2  [ 4, 10]      * |         4 0.666666666666667
  e     chr1  [ 5, 10]      * |         5 0.555555555555556
  f     chr1  [ 6, 10]      + |         6 0.444444444444444
  g     chr3  [ 7, 10]      + |         7 0.333333333333333
  h     chr3  [ 8, 10]      + |         8 0.222222222222222
  i     chr3  [ 9, 10]      - |         9 0.111111111111111
  j     chr3  [10, 10]      - |        10                 0
  ---
  seqlengths:
   chr1 chr2 chr3
   1000 2000 1500

想获取每个序列的名称,使用names,赋值使用names(x) <- value

> names(gr)
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
> names(gr)[1:3] <- LETTERS[1:3]
> gr
GRanges with 10 ranges and 2 metadata columns:
    seqnames    ranges strand |     score                GC
       <Rle> <IRanges>  <Rle> | <integer>         <numeric>
  A     chr1  [ 1, 10]      - |         1                 1
  B     chr2  [ 2, 10]      + |         2 0.888888888888889
  C     chr2  [ 3, 10]      + |         3 0.777777777777778
  d     chr2  [ 4, 10]      * |         4 0.666666666666667
  e     chr1  [ 5, 10]      * |         5 0.555555555555556
  f     chr1  [ 6, 10]      + |         6 0.444444444444444
  g     chr3  [ 7, 10]      + |         7 0.333333333333333
  h     chr3  [ 8, 10]      + |         8 0.222222222222222
  i     chr3  [ 9, 10]      - |         9 0.111111111111111
  j     chr3  [10, 10]      - |        10                 0
  ---
  seqlengths:
   chr1 chr2 chr3
   1000 2000 1500

想获取每个序列的正负链,使用strand, 赋值使用strand(x) <- value。同样的,获取序列的宽度使用width即可。

> strand(gr)
factor-Rle of length 10 with 5 runs
  Lengths: 1 2 2 3 2
  Values : - + * + -
Levels(3): + - *
> width(gr)
 [1] 10  9  8  7  6  5  4  3  2  1

如果我们想获取序列的score值以及GC值又该如何呢?
对于score值,我们可以使用score(x)来获取,注意,这是非常特殊的一个例,并不意味着在竖线右侧的值都可以使用其列名来获取其值。对于GRanges的列值的获取,其实是通过data.frame的操作来完成的。首先需要使用到mcols(x, use.names=FALSE)来将GRanges竖线右侧的值(又称metadata,故而使用m[etadata]col[umn]s来取值),示例:

> score(gr)
 [1]  1  2  3  4  5  6  7  8  9 10
> mcols(gr)
DataFrame with 10 rows and 2 columns
       score        GC
   <integer> <numeric>
1          1 1.0000000
2          2 0.8888889
3          3 0.7777778
4          4 0.6666667
5          5 0.5555556
6          6 0.4444444
7          7 0.3333333
8          8 0.2222222
9          9 0.1111111
10        10 0.0000000
> mcols(gr)$GC
 [1] 1.0000000 0.8888889 0.7777778 0.6666667 0.5555556 0.4444444 0.3333333 0.2222222 0.1111111
[10] 0.0000000
> mcols(gr)$score
 [1]  1  2  3  4  5  6  7  8  9 10
> gr$GC
 [1] 1.0000000 0.8888889 0.7777778 0.6666667 0.5555556 0.4444444 0.3333333 0.2222222 0.1111111
[10] 0.0000000
> gr$score
 [1]  1  2  3  4  5  6  7  8  9 10
> gr$score[1:5] <- 5:1
> gr$score
 [1]  5  4  3  2  1  6  7  8  9 10

上例中,gr$score其实是mcols(gr)$score的缩写。当我们使用mcols(gr)时,其内容就是一个data.frame了。
从上面的例子我们似乎看不到GRanges类似vector的任何地方,又如何说GRanges更类似vector呢?我们试着访问gr的某一行就了解了。

> gr[1:3]
GRanges with 3 ranges and 2 metadata columns:
    seqnames    ranges strand |     score                GC
       <Rle> <IRanges>  <Rle> | <integer>         <numeric>
  A     chr1   [1, 10]      - |         5                 1
  B     chr2   [2, 10]      + |         4 0.888888888888889
  C     chr2   [3, 10]      + |         3 0.777777777777778
  ---
  seqlengths:
   chr1 chr2 chr3
   1000 2000 1500
> gr[c("A","d","f")]
GRanges with 3 ranges and 2 metadata columns:
    seqnames    ranges strand |     score                GC
       <Rle> <IRanges>  <Rle> | <integer>         <numeric>
  A     chr1   [1, 10]      - |         5                 1
  d     chr2   [4, 10]      * |         2 0.666666666666667
  f     chr1   [6, 10]      + |         6 0.444444444444444
  ---
  seqlengths:
   chr1 chr2 chr3
   1000 2000 1500

可以看出,访问某一行时,不再使用gr[i,]的形式,而是直接使用gr[i],这可以说是一种简化吧。但看起来就更类vector了。
对于其它操作,可以使用?GRanges来获取帮助。这里就不多言了。

接下来介绍,一些比较常规的运算,看列表:(源http://bioconductor.org/help/course-materials/2013/SeattleFeb2013/IntermediateSequenceAnalysis2013.pdf)

类别 函数 描述
接口Accessors start, end, width

names
mcols, metadata
length
range

获取及赋值开始,终止,以及宽度。需要注意的是直接设置终止值可能会难以通过,可以试着更改宽度来变列终止值
获取及赋值名称
获取及赋值metadata
获取GRanges的长度(或者不准确地说是行数)
获取GRanges的ranges值
排序Ordering <, <=, >, >=, ==, !=
sort, order, rank
duplicated
unique
比较大小, 以起始位置而后宽度排序

排序
找出重复出现的ranges
去除重复出现的ranges

基本运算Arithmetic r + x, r – x, r * x
shift
resize
distance
restrict
Shrink or expand ranges r by number x
对位置进行平移
改变宽度
计算距离
截取某一范围内的值
集合操作Set operations reduce
intersect, union, setdiff
pintersect, punion, psetdiff
gaps, pgap
disjoin
将重叠的部分合并
查看交集,并集以及补集运算

对每一对x[i], y[i]进行交集,并集以及补集运算

获取空白的区域
将重合的区域分割成小片段

重叠运算Overlaps findOverlaps
countOverlaps
nearest
precede, follow
x %in% y
查找overlaps
计数overlaps
查找最近的
查找之前或者之后的
x中在y中出现过的
覆盖运算Coverage coverage 计算每个位置的覆盖数
取值Extraction r[i]
subsetByOverlaps
head, tail, rev, rep
取某一(或者几)行的值
取有overlaps的部分
传统R运算

分离与组合Split, combine split
c
按某factor分类
组合两个或者多个ranges
> ranges(gr)
IRanges of length 10
     start end width names
[1]      1  10    10     A
[2]      2  10     9     B
[3]      3  10     8     C
[4]      4  10     7     d
[5]      5  10     6     e
[6]      6  10     5     f
[7]      7  10     4     g
[8]      8  10     3     h
[9]      9  10     2     i
[10]    10  10     1     j
> shift(gr,5)
GRanges with 10 ranges and 2 metadata columns:
    seqnames    ranges strand |     score                GC
       <Rle> <IRanges>  <Rle> | <integer>         <numeric>
  A     chr1  [ 6, 15]      - |         5                 1
  B     chr2  [ 7, 15]      + |         4 0.888888888888889
  C     chr2  [ 8, 15]      + |         3 0.777777777777778
  d     chr2  [ 9, 15]      * |         2 0.666666666666667
  e     chr1  [10, 15]      * |         1 0.555555555555556
  f     chr1  [11, 15]      + |         6 0.444444444444444
  g     chr3  [12, 15]      + |         7 0.333333333333333
  h     chr3  [13, 15]      + |         8 0.222222222222222
  i     chr3  [14, 15]      - |         9 0.111111111111111
  j     chr3  [15, 15]      - |        10                 0
  ---
  seqlengths:
   chr1 chr2 chr3
   1000 2000 1500
> resize(gr, 2)
GRanges with 10 ranges and 2 metadata columns:
    seqnames    ranges strand |     score                GC
       <Rle> <IRanges>  <Rle> | <integer>         <numeric>
  A     chr1   [9, 10]      - |         5                 1
  B     chr2   [2,  3]      + |         4 0.888888888888889
  C     chr2   [3,  4]      + |         3 0.777777777777778
  d     chr2   [4,  5]      * |         2 0.666666666666667
  e     chr1   [5,  6]      * |         1 0.555555555555556
  f     chr1   [6,  7]      + |         6 0.444444444444444
  g     chr3   [7,  8]      + |         7 0.333333333333333
  h     chr3   [8,  9]      + |         8 0.222222222222222
  i     chr3   [9, 10]      - |         9 0.111111111111111
  j     chr3   [9, 10]      - |        10                 0
  ---
  seqlengths:
   chr1 chr2 chr3
   1000 2000 1500
> gr1 <- shift(gr,3)
> distance(gr, gr1)
 [1] 0 0 0 0 0 0 0 0 1 2
> restrict(gr, 5, 7)
GRanges with 8 ranges and 2 metadata columns:
    seqnames    ranges strand |     score                GC
       <Rle> <IRanges>  <Rle> | <integer>         <numeric>
  A     chr1    [5, 7]      - |         5                 1
  B     chr2    [5, 7]      + |         4 0.888888888888889
  C     chr2    [5, 7]      + |         3 0.777777777777778
  d     chr2    [5, 7]      * |         2 0.666666666666667
  e     chr1    [5, 7]      * |         1 0.555555555555556
  f     chr1    [6, 7]      + |         6 0.444444444444444
  g     chr3    [7, 7]      + |         7 0.333333333333333
  h     chr3    [8, 7]      + |         8 0.222222222222222
  ---
  seqlengths:
   chr1 chr2 chr3
   1000 2000 1500
> reduce(gr)
GRanges with 7 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1   [6, 10]      +
  [2]     chr1   [1, 10]      -
  [3]     chr1   [5, 10]      *
  [4]     chr2   [2, 10]      +
  [5]     chr2   [4, 10]      *
  [6]     chr3   [7, 10]      +
  [7]     chr3   [9, 10]      -
  ---
  seqlengths:
   chr1 chr2 chr3
   1000 2000 1500
> intersect(gr, gr1)
GRanges with 6 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1  [ 9, 10]      +
  [2]     chr1  [ 4, 10]      -
  [3]     chr1  [ 8, 10]      *
  [4]     chr2  [ 5, 10]      +
  [5]     chr2  [ 7, 10]      *
  [6]     chr3  [10, 10]      +
  ---
  seqlengths:
   chr1 chr2 chr3
   1000 2000 1500
> punion(gr, gr1,fill.gap=TRUE)
GRanges with 10 ranges and 0 metadata columns:
    seqnames    ranges strand
       <Rle> <IRanges>  <Rle>
  A     chr1  [ 1, 13]      -
  B     chr2  [ 2, 13]      +
  C     chr2  [ 3, 13]      +
  d     chr2  [ 4, 13]      *
  e     chr1  [ 5, 13]      *
  f     chr1  [ 6, 13]      +
  g     chr3  [ 7, 13]      +
  h     chr3  [ 8, 13]      +
  i     chr3  [ 9, 13]      -
  j     chr3  [10, 13]      -
  ---
  seqlengths:
   chr1 chr2 chr3
   1000 2000 1500
> gaps(gr)
GRanges with 15 ranges and 0 metadata columns:
       seqnames     ranges strand
          <Rle>  <IRanges>  <Rle>
   [1]     chr1 [ 1,    5]      +
   [2]     chr1 [11, 1000]      +
   [3]     chr1 [11, 1000]      -
   [4]     chr1 [ 1,    4]      *
   [5]     chr1 [11, 1000]      *
   ...      ...        ...    ...
  [11]     chr3 [ 1,    6]      +
  [12]     chr3 [11, 1500]      +
  [13]     chr3 [ 1,    8]      -
  [14]     chr3 [11, 1500]      -
  [15]     chr3 [ 1, 1500]      *
  ---
  seqlengths:
   chr1 chr2 chr3
   1000 2000 1500
> disjoin(gr)
GRanges with 10 ranges and 0 metadata columns:
       seqnames    ranges strand
          <Rle> <IRanges>  <Rle>
   [1]     chr1  [ 6, 10]      +
   [2]     chr1  [ 1, 10]      -
   [3]     chr1  [ 5, 10]      *
   [4]     chr2  [ 2,  2]      +
   [5]     chr2  [ 3, 10]      +
   [6]     chr2  [ 4, 10]      *
   [7]     chr3  [ 7,  7]      +
   [8]     chr3  [ 8, 10]      +
   [9]     chr3  [ 9,  9]      -
  [10]     chr3  [10, 10]      -
  ---
  seqlengths:
   chr1 chr2 chr3
   1000 2000 1500
> findOverlaps(gr, gr1)
Hits of length 18
queryLength: 10
subjectLength: 10
    queryHits subjectHits 
     <integer>   <integer> 
 1           1           1 
 2           1           5 
 3           2           2 
 4           2           3 
 5           2           4 
 ...       ...         ... 
 14          5           6 
 15          6           5 
 16          6           6 
 17          7           7 
 18          8           7 
> countOverlaps(gr, gr1)
A B C d e f g h i j 
2 3 3 3 3 2 1 1 0 0 
> gr %in% gr1
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> coverage(gr)
RleList of length 3
$chr1
integer-Rle of length 1000 with 4 runs
  Lengths:   4   1   5 990
  Values :   1   2   3   0
 
$chr2
integer-Rle of length 2000 with 5 runs
  Lengths:    1    1    1    7 1990
  Values :    0    1    2    3    0
 
$chr3
integer-Rle of length 1500 with 6 runs
  Lengths:    6    1    1    1    1 1490
  Values :    0    1    2    3    4    0
> gr[1:3]
GRanges with 3 ranges and 2 metadata columns:
    seqnames    ranges strand |     score                GC
       <Rle> <IRanges>  <Rle> | <integer>         <numeric>
  A     chr1   [1, 10]      - |         5                 1
  B     chr2   [2, 10]      + |         4 0.888888888888889
  C     chr2   [3, 10]      + |         3 0.777777777777778
  ---
  seqlengths:
   chr1 chr2 chr3
   1000 2000 1500
> gr[3]
GRanges with 1 range and 2 metadata columns:
    seqnames    ranges strand |     score                GC
       <Rle> <IRanges>  <Rle> | <integer>         <numeric>
  C     chr2   [3, 10]      + |         3 0.777777777777778
  ---
  seqlengths:
   chr1 chr2 chr3
   1000 2000 1500

Leave a Reply

  

  

  

%d 博主赞过: