cbioportal的CGDS-R库使用介绍

今天要介绍的这个包是cBio Cancer Genomics Portal提供的服务的一部分。如果你经常使用他们的服务,那么,学习本节的内容就会让你更加得心应手地使用服务了。

下面的内容,基本就是对帮助文件的一个翻译。所以如果英语还过得去的同学,最好是读原文了。

快速上手

Cancer Genomic Data Server(CGDS-R)库是一个非常类似biomaRt的库,它的作用就是可以从R中直接读取远程服务器端的数据。因为读取数据库都需要一定的格式,CGDS-R就帮助我们规范地读取数据。

## 安装CGDS-R库
install.packages("cgdsr")

连接数据库,这基本上是固定的。因为现在似乎只有Memorial Sloan-Kettering Cancer Center (MSKCC)才提供数据源。

library(cgdsr)
mycgds <- CGDS("http://www.cbioportal.org/public-portal/")

在第一次使用CGDS-R库时,最好做一个test。看看能不能正常使用。

test(mycgds)
## getCancerStudies...  OK
## getCaseLists (1/2) ...  OK
## getCaseLists (2/2) ...  OK
## getGeneticProfiles (1/2) ...  OK
## getGeneticProfiles (2/2) ...  OK
## getClinicalData (1/1) ...  OK
## getProfileData (1/6) ...  OK
## getProfileData (2/6) ...  OK
## getProfileData (3/6) ...  OK
## getProfileData (4/6) ...  OK
## getProfileData (5/6) ...  OK
## getProfileData (6/6) ...  OK

如果需要读取数据,我们需要知道如何查询数据。熟悉SQL的人会想到,我们需要先了解数据库中有哪些表格,然后再知道表格中的表头,之后就可以依据我们的需要来设定筛选条件了。对于使用CGDS-R库,其步骤也大约如是。

## 获取有哪些数据集
studies <- getCancerStudies(mycgds)
studies[1:3, 1:2]
##   cancer_study_id                                             name
## 1   laml_tcga_pub         Acute Myeloid Leukemia (TCGA, NEJM 2013)
## 2       laml_tcga       Acute Myeloid Leukemia (TCGA, Provisional)
## 3      acyc_mskcc Adenoid Cystic Carcinoma (MSKCC, Nat Genet 2013)
## 假设我们感兴趣的是TCGA的Lung squamous cell carcinoma (LUSC)
studies[grepl("Lung squamous cell carcinoma", studies[, 2], ignore.case=TRUE), 1:2]
##    cancer_study_id                                             name
## 54   lusc_tcga_pub Lung Squamous Cell Carcinoma (TCGA, Nature 2012)
## 55       lusc_tcga Lung Squamous Cell Carcinoma (TCGA, Provisional)
## 这里的cancer_study_id其实就是数据集的名字,我们选择一个数据集,Nature 2012的。
lusc2012 <- "lusc_tcga_pub"
## 获取在lusc数据集中有哪些表格
tables <- getCaseLists(mycgds, lusc2012)
dim(tables)
## [1] 17  5
tables[, -5]
##                      case_list_id
## 1     lusc_tcga_pub_3way_complete
## 2               lusc_tcga_pub_all
## 3             lusc_tcga_pub_basal
## 4         lusc_tcga_pub_classical
## 5         lusc_tcga_pub_primitive
## 6         lusc_tcga_pub_secretory
## 7         lusc_tcga_pub_icluster1
## 8         lusc_tcga_pub_icluster2
## 9         lusc_tcga_pub_icluster3
## 10        lusc_tcga_pub_sequenced
## 11              lusc_tcga_pub_cna
## 12          lusc_tcga_pub_log2CNA
## 13 lusc_tcga_pub_methylation_hm27
## 14         lusc_tcga_pub_microrna
## 15             lusc_tcga_pub_mrna
## 16     lusc_tcga_pub_rna_seq_mrna
## 17           lusc_tcga_pub_cnaseq
##                                case_list_name
## 1                         All Complete Tumors
## 2                                  All Tumors
## 3                  Expression Subtype: Basal 
## 4              Expression Subtype: Classical 
## 5              Expression Subtype: Primitive 
## 6              Expression Subtype: Secretory 
## 7                         iCluster Subtype 1 
## 8                         iCluster Subtype 2 
## 9                          iCluster Subtype 3
## 10                           Sequenced Tumors
## 11                                 Tumors CNA
## 12                    Tumors log2 copy-number
## 13        Tumors with methylation data (HM27)
## 14   Tumors with microRNA data (microRNA-Seq)
## 15 Tumors with mRNA data (Agilent microarray)
## 16            Tumors with mRNA data (RNA Seq)
## 17        Tumors with sequencing and CNA data
##                                                      case_list_description
## 1  All tumor samples that have mRNA, CNA and sequencing data (178 samples)
## 2                                          All tumor samples (178 samples)
## 3                                    Basal expression subtype (43 samples)
## 4                                Classical expression subtype (65 samples)
## 5                                Primitive expression subtype (27 samples)
## 6                                Secretory expression subtype (43 samples)
## 7                                          iCluster Subtype 1 (48 samples)
## 8                                          iCluster Subtype 2 (62 samples)
## 9                                          iCluster Subtype 3 (68 samples)
## 10                                     All sequenced samples (178 samples)
## 11                                  All tumors with CNA data (178 samples)
## 12                     All tumors with log2 copy-number data (178 samples)
## 13                  All samples with methylation (HM27) data (104 samples)
## 14                            All samples with microRNA data (110 samples)
## 15                     All samples with mRNA expression data (121 samples)
## 16                     All samples with mRNA expression data (178 samples)
## 17       All tumor samples that have CNA and sequencing data (178 samples)
##    cancer_study_id
## 1               66
## 2               66
## 3               66
## 4               66
## 5               66
## 6               66
## 7               66
## 8               66
## 9               66
## 10              66
## 11              66
## 12              66
## 13              66
## 14              66
## 15              66
## 16              66
## 17              66
## 我们需要了解LUSC中一些基因的表达值,突变以及拷贝数变化,所以选择表1
table <- "lusc_tcga_pub_3way_complete"
## 而后获取表头
header <- getGeneticProfiles(mycgds, lusc2012)
header[, -3]
##                           genetic_profile_id
## 1                       lusc_tcga_pub_gistic
## 2  lusc_tcga_pub_rna_seq_mrna_median_Zscores
## 3          lusc_tcga_pub_mrna_median_Zscores
## 4                 lusc_tcga_pub_rna_seq_mrna
## 5                      lusc_tcga_pub_log2CNA
## 6             lusc_tcga_pub_methylation_hm27
## 7                    lusc_tcga_pub_mutations
## 8                        lusc_tcga_pub_mirna
## 9         lusc_tcga_pub_mirna_median_Zscores
## 10  lusc_tcga_pub_mrna_merged_median_Zscores
## 11                        lusc_tcga_pub_mrna
##                            genetic_profile_name cancer_study_id
## 1  Putative copy-number alterations from GISTIC              66
## 2       mRNA Expression z-Scores (RNA Seq RPKM)              66
## 3         mRNA Expression z-Scores (microarray)              66
## 4                mRNA expression (RNA Seq RPKM)              66
## 5                       Log2 copy-number values              66
## 6                            Methylation (HM27)              66
## 7                                     Mutations              66
## 8                           microRNA expression              66
## 9                  microRNA expression Z-scores              66
## 10   mRNA/miRNA expression Z-scores (all genes)              66
## 11                 mRNA expression (microarray)              66
##    genetic_alteration_type show_profile_in_analysis_tab
## 1   COPY_NUMBER_ALTERATION                         true
## 2          MRNA_EXPRESSION                         true
## 3          MRNA_EXPRESSION                         true
## 4          MRNA_EXPRESSION                        false
## 5   COPY_NUMBER_ALTERATION                        false
## 6              METHYLATION                        false
## 7        MUTATION_EXTENDED                         true
## 8          MRNA_EXPRESSION                        false
## 9          MRNA_EXPRESSION                        false
## 10         MRNA_EXPRESSION                         true
## 11         MRNA_EXPRESSION                        false
## 也许大家会觉得奇怪,为什么在获取表头的时候没有把表格的名字传进去。
## 这里我其实也很不解。猜想这会在下一步获取数据时自动过滤不合适的表头吧。
## 而后就可以查询数据了。比如我们想了解两个DNA损伤修复基因的表达值等
header <- c("lusc_tcga_pub_rna_seq_mrna", 
          "lusc_tcga_pub_gistic", 
          "lusc_tcga_pub_mutations")
BRCA1 <- getProfileData(mycgds, "BRCA1", header, table)
dim(BRCA1)
## [1] 178   3
head(BRCA1)
##                 lusc_tcga_pub_rna_seq_mrna lusc_tcga_pub_gistic
## TCGA.18.3406.01                   0.722159             0.000000
## TCGA.18.3407.01                   2.157836             0.000000
## TCGA.18.3408.01                   2.154682             0.000000
## TCGA.18.3409.01           1.54504826804376                    0
## TCGA.18.3410.01                   3.905734             0.000000
## TCGA.18.3411.01                   2.869195             0.000000
##                 lusc_tcga_pub_mutations
## TCGA.18.3406.01                    <NA>
## TCGA.18.3407.01                    <NA>
## TCGA.18.3408.01                    <NA>
## TCGA.18.3409.01                   V525A
## TCGA.18.3410.01                    <NA>
## TCGA.18.3411.01                    <NA>
## 上面是获取一个基因的多个profile。下面试一下获取多个基因的相同profile
data <- getProfileData(mycgds, c("BRCA1", "BRCA2"), "lusc_tcga_pub_gistic", table)
head(data)
##                 BRCA1 BRCA2
## TCGA.18.3406.01     0     0
## TCGA.18.3407.01     0     0
## TCGA.18.3408.01     0     0
## TCGA.18.3409.01     0     0
## TCGA.18.3410.01     0    -1
## TCGA.18.3411.01     0    -1
## 但是我们不能同时获取多个基因的多种profiles。
getProfileData(mycgds, c("BRCA1", "BRCA2"), header, table)
## [1] Error..You.can.specify.multiple.genes.or.multiple.genetic.profiles..but.not.both.at.once.
## <0 rows> (or 0-length row.names)
## 如果我们需要绘制survival curve,那么需要获取clinical数据
clinicaldata <- getClinicalData(mycgds, table)
## 很遗憾,啥都没读回来。可以试试其它的
clinicaldata <- 
  getClinicalData(mycgds, 
                  getCaseLists(mycgds, 
                               getCancerStudies(mycgds)[2, 1])[1, 1])
clinicaldata[1:5, 1:5]
##                 AGE CYTOGENETIC_ABNORMALITY_TYPE DAYS_TO_BIRTH
## TCGA.AB.2992.03  32                       Normal        -11839
## TCGA.AB.2980.03  50                    t (15;17)        -18506
## TCGA.AB.2806.03  46                     t (8;21)        -16892
## TCGA.AB.2910.03  61                       Normal        -22311
## TCGA.AB.2991.03  40          Trisomy 8|t (15;17)        -14885
##                 DAYS_TO_DEATH DAYS_TO_LAST_FOLLOWUP
## TCGA.AB.2992.03          1706                    NA
## TCGA.AB.2980.03            NA                   699
## TCGA.AB.2806.03           945                    NA
## TCGA.AB.2910.03             0                    NA
## TCGA.AB.2991.03            NA                  1826

如果想了解更多,就读一读它的帮助文档吧。

Leave a Reply

  

  

  

%d 博主赞过: