今天要介绍的这个包是cBio Cancer Genomics Portal提供的服务的一部分。如果你经常使用他们的服务,那么,学习本节的内容就会让你更加得心应手地使用服务了。
下面的内容,基本就是对帮助文件的一个翻译。所以如果英语还过得去的同学,最好是读原文了。
快速上手
Cancer Genomic Data Server(CGDS-R)库是一个非常类似biomaRt的库,它的作用就是可以从R中直接读取远程服务器端的数据。因为读取数据库都需要一定的格式,CGDS-R就帮助我们规范地读取数据。
## 安装CGDS-R库
install.packages("cgdsr")
连接数据库,这基本上是固定的。因为现在似乎只有Memorial Sloan-Kettering Cancer Center (MSKCC)才提供数据源。
library(cgdsr)
mycgds <- CGDS("http://www.cbioportal.org/public-portal/")
在第一次使用CGDS-R库时,最好做一个test。看看能不能正常使用。
test(mycgds)
## getCancerStudies... OK
## getCaseLists (1/2) ... OK
## getCaseLists (2/2) ... OK
## getGeneticProfiles (1/2) ... OK
## getGeneticProfiles (2/2) ... OK
## getClinicalData (1/1) ... OK
## getProfileData (1/6) ... OK
## getProfileData (2/6) ... OK
## getProfileData (3/6) ... OK
## getProfileData (4/6) ... OK
## getProfileData (5/6) ... OK
## getProfileData (6/6) ... OK
如果需要读取数据,我们需要知道如何查询数据。熟悉SQL的人会想到,我们需要先了解数据库中有哪些表格,然后再知道表格中的表头,之后就可以依据我们的需要来设定筛选条件了。对于使用CGDS-R库,其步骤也大约如是。
## 获取有哪些数据集
studies <- getCancerStudies(mycgds)
studies[1:3, 1:2]
## cancer_study_id name
## 1 laml_tcga_pub Acute Myeloid Leukemia (TCGA, NEJM 2013)
## 2 laml_tcga Acute Myeloid Leukemia (TCGA, Provisional)
## 3 acyc_mskcc Adenoid Cystic Carcinoma (MSKCC, Nat Genet 2013)
## 假设我们感兴趣的是TCGA的Lung squamous cell carcinoma (LUSC)
studies[grepl("Lung squamous cell carcinoma", studies[, 2], ignore.case=TRUE), 1:2]
## cancer_study_id name
## 54 lusc_tcga_pub Lung Squamous Cell Carcinoma (TCGA, Nature 2012)
## 55 lusc_tcga Lung Squamous Cell Carcinoma (TCGA, Provisional)
## 这里的cancer_study_id其实就是数据集的名字,我们选择一个数据集,Nature 2012的。
lusc2012 <- "lusc_tcga_pub"
## 获取在lusc数据集中有哪些表格
tables <- getCaseLists(mycgds, lusc2012)
dim(tables)
## [1] 17 5
tables[, -5]
## case_list_id
## 1 lusc_tcga_pub_3way_complete
## 2 lusc_tcga_pub_all
## 3 lusc_tcga_pub_basal
## 4 lusc_tcga_pub_classical
## 5 lusc_tcga_pub_primitive
## 6 lusc_tcga_pub_secretory
## 7 lusc_tcga_pub_icluster1
## 8 lusc_tcga_pub_icluster2
## 9 lusc_tcga_pub_icluster3
## 10 lusc_tcga_pub_sequenced
## 11 lusc_tcga_pub_cna
## 12 lusc_tcga_pub_log2CNA
## 13 lusc_tcga_pub_methylation_hm27
## 14 lusc_tcga_pub_microrna
## 15 lusc_tcga_pub_mrna
## 16 lusc_tcga_pub_rna_seq_mrna
## 17 lusc_tcga_pub_cnaseq
## case_list_name
## 1 All Complete Tumors
## 2 All Tumors
## 3 Expression Subtype: Basal
## 4 Expression Subtype: Classical
## 5 Expression Subtype: Primitive
## 6 Expression Subtype: Secretory
## 7 iCluster Subtype 1
## 8 iCluster Subtype 2
## 9 iCluster Subtype 3
## 10 Sequenced Tumors
## 11 Tumors CNA
## 12 Tumors log2 copy-number
## 13 Tumors with methylation data (HM27)
## 14 Tumors with microRNA data (microRNA-Seq)
## 15 Tumors with mRNA data (Agilent microarray)
## 16 Tumors with mRNA data (RNA Seq)
## 17 Tumors with sequencing and CNA data
## case_list_description
## 1 All tumor samples that have mRNA, CNA and sequencing data (178 samples)
## 2 All tumor samples (178 samples)
## 3 Basal expression subtype (43 samples)
## 4 Classical expression subtype (65 samples)
## 5 Primitive expression subtype (27 samples)
## 6 Secretory expression subtype (43 samples)
## 7 iCluster Subtype 1 (48 samples)
## 8 iCluster Subtype 2 (62 samples)
## 9 iCluster Subtype 3 (68 samples)
## 10 All sequenced samples (178 samples)
## 11 All tumors with CNA data (178 samples)
## 12 All tumors with log2 copy-number data (178 samples)
## 13 All samples with methylation (HM27) data (104 samples)
## 14 All samples with microRNA data (110 samples)
## 15 All samples with mRNA expression data (121 samples)
## 16 All samples with mRNA expression data (178 samples)
## 17 All tumor samples that have CNA and sequencing data (178 samples)
## cancer_study_id
## 1 66
## 2 66
## 3 66
## 4 66
## 5 66
## 6 66
## 7 66
## 8 66
## 9 66
## 10 66
## 11 66
## 12 66
## 13 66
## 14 66
## 15 66
## 16 66
## 17 66
## 我们需要了解LUSC中一些基因的表达值,突变以及拷贝数变化,所以选择表1
table <- "lusc_tcga_pub_3way_complete"
## 而后获取表头
header <- getGeneticProfiles(mycgds, lusc2012)
header[, -3]
## genetic_profile_id
## 1 lusc_tcga_pub_gistic
## 2 lusc_tcga_pub_rna_seq_mrna_median_Zscores
## 3 lusc_tcga_pub_mrna_median_Zscores
## 4 lusc_tcga_pub_rna_seq_mrna
## 5 lusc_tcga_pub_log2CNA
## 6 lusc_tcga_pub_methylation_hm27
## 7 lusc_tcga_pub_mutations
## 8 lusc_tcga_pub_mirna
## 9 lusc_tcga_pub_mirna_median_Zscores
## 10 lusc_tcga_pub_mrna_merged_median_Zscores
## 11 lusc_tcga_pub_mrna
## genetic_profile_name cancer_study_id
## 1 Putative copy-number alterations from GISTIC 66
## 2 mRNA Expression z-Scores (RNA Seq RPKM) 66
## 3 mRNA Expression z-Scores (microarray) 66
## 4 mRNA expression (RNA Seq RPKM) 66
## 5 Log2 copy-number values 66
## 6 Methylation (HM27) 66
## 7 Mutations 66
## 8 microRNA expression 66
## 9 microRNA expression Z-scores 66
## 10 mRNA/miRNA expression Z-scores (all genes) 66
## 11 mRNA expression (microarray) 66
## genetic_alteration_type show_profile_in_analysis_tab
## 1 COPY_NUMBER_ALTERATION true
## 2 MRNA_EXPRESSION true
## 3 MRNA_EXPRESSION true
## 4 MRNA_EXPRESSION false
## 5 COPY_NUMBER_ALTERATION false
## 6 METHYLATION false
## 7 MUTATION_EXTENDED true
## 8 MRNA_EXPRESSION false
## 9 MRNA_EXPRESSION false
## 10 MRNA_EXPRESSION true
## 11 MRNA_EXPRESSION false
## 也许大家会觉得奇怪,为什么在获取表头的时候没有把表格的名字传进去。
## 这里我其实也很不解。猜想这会在下一步获取数据时自动过滤不合适的表头吧。
## 而后就可以查询数据了。比如我们想了解两个DNA损伤修复基因的表达值等
header <- c("lusc_tcga_pub_rna_seq_mrna",
"lusc_tcga_pub_gistic",
"lusc_tcga_pub_mutations")
BRCA1 <- getProfileData(mycgds, "BRCA1", header, table)
dim(BRCA1)
## [1] 178 3
head(BRCA1)
## lusc_tcga_pub_rna_seq_mrna lusc_tcga_pub_gistic
## TCGA.18.3406.01 0.722159 0.000000
## TCGA.18.3407.01 2.157836 0.000000
## TCGA.18.3408.01 2.154682 0.000000
## TCGA.18.3409.01 1.54504826804376 0
## TCGA.18.3410.01 3.905734 0.000000
## TCGA.18.3411.01 2.869195 0.000000
## lusc_tcga_pub_mutations
## TCGA.18.3406.01 <NA>
## TCGA.18.3407.01 <NA>
## TCGA.18.3408.01 <NA>
## TCGA.18.3409.01 V525A
## TCGA.18.3410.01 <NA>
## TCGA.18.3411.01 <NA>
## 上面是获取一个基因的多个profile。下面试一下获取多个基因的相同profile
data <- getProfileData(mycgds, c("BRCA1", "BRCA2"), "lusc_tcga_pub_gistic", table)
head(data)
## BRCA1 BRCA2
## TCGA.18.3406.01 0 0
## TCGA.18.3407.01 0 0
## TCGA.18.3408.01 0 0
## TCGA.18.3409.01 0 0
## TCGA.18.3410.01 0 -1
## TCGA.18.3411.01 0 -1
## 但是我们不能同时获取多个基因的多种profiles。
getProfileData(mycgds, c("BRCA1", "BRCA2"), header, table)
## [1] Error..You.can.specify.multiple.genes.or.multiple.genetic.profiles..but.not.both.at.once.
## <0 rows> (or 0-length row.names)
## 如果我们需要绘制survival curve,那么需要获取clinical数据
clinicaldata <- getClinicalData(mycgds, table)
## 很遗憾,啥都没读回来。可以试试其它的
clinicaldata <-
getClinicalData(mycgds,
getCaseLists(mycgds,
getCancerStudies(mycgds)[2, 1])[1, 1])
clinicaldata[1:5, 1:5]
## AGE CYTOGENETIC_ABNORMALITY_TYPE DAYS_TO_BIRTH
## TCGA.AB.2992.03 32 Normal -11839
## TCGA.AB.2980.03 50 t (15;17) -18506
## TCGA.AB.2806.03 46 t (8;21) -16892
## TCGA.AB.2910.03 61 Normal -22311
## TCGA.AB.2991.03 40 Trisomy 8|t (15;17) -14885
## DAYS_TO_DEATH DAYS_TO_LAST_FOLLOWUP
## TCGA.AB.2992.03 1706 NA
## TCGA.AB.2980.03 NA 699
## TCGA.AB.2806.03 945 NA
## TCGA.AB.2910.03 0 NA
## TCGA.AB.2991.03 NA 1826
如果想了解更多,就读一读它的帮助文档吧。