bioconductor系列教程之三–Affymetrix 外显子(exon)分析综合

确定alternative splicing events

使用MiDAS
计算MiDAS需要使用到APT.为些我们需要准备一个cels.txt文件。在R中

cel_files<-colnames(d.gene)
group_id<-c(rep("ctrl",5),rep("expr",5))
cels<-cbind(cel_files,group_id)
write.table(cels,"cels.txt",quote=F,sep="\t",row.names=F)

然后运行APT

apt-midas –cel-files cels.txt –g ./OUT_GENE/rma-sketch_core_gene.summary_filtered.txt –e ./OUT_EXON/rmasketch_core_exon.summary_filtered.txt –m HuEx-1_0-st-v2.r2.core.mps –o ./OUT_MIDAS –nol

最后我们将其结果读入R,并以q.value<0.05为条件确定alternative splicing events。

midas <- read.table("midas.pvalues.txt", sep="\t", header=T)
midas <- midas[midas$pvalue<0.05,]
o <- order(midas$pvalue)
midas.ordered <- midas[o,]

使用SI/LIMMA

ni<-d.exon.fil2-d.gene.fil[annot.map[row.names(d.exon.fil2)],]#计算NI值
ni<-ni[!is.na(ni[,1]),]#去除被filter掉的基因中的外显子
g<-factor(c(rep("ctrl",length(ctrl)),rep("expr",length(expr))))#设置design
design<-model.matrix(~-1+g)
colnames(design)<-gsub("^g","",colnames(design))#设置contrast.matrix
constr<-"expr-ctrl"
contrast.matrix<-makeContrasts(contrasts=constr,levels=design)
fit<-lmFit(ni,design)#do fit
fit1<-contrasts.fit(fit,contrast.matrix)
fit2<-eBayes(fit1)
limma.res<-topTable(fit2,coef=constr,n=nrow(fit2))
limma.res<-limma.res[limma.res$adj.P.Val<0.01,]
write.csv(limma.res,"si.limma.csv",row.names=F)

使用FIRMA

firma<-FirmaModel(plmTr)
fit(firma,verbose=verbose)
fs<-getFirmaScores(firma)
fsRes<-extractDataFrame(fs, units=NULL, addNames=TRUE)
fsLog2<-log2(fsRes[,6:ncol(fsRes)])
fsLog2<-cbind(fsRes[,1:5],fsLog2)
write.csv(fsLog2,"FIRMA.score.core.csv",row.names=F)

使用MADS以及ASA

#使用MADS确定alternative splicing events
param = jetta.do.mads(CELs,CLF,PGF,ProbesetAnnotationFile=PSA,SummarizationMethod="median_polish",SummarizationTarget='tc',MADSTTestType="unpaired",MADSSampleClass=c(-1,-1,-1,1,1,1))
mads = jetta.read.mads.output(param)
mads.sel = jetta.asa.filtering(mads,tc.expr.cutoff=100,tc.fc.cutoff=1.5,dabg.cutoff=0.05,mads.cutoff=0.001,alternative.only=F)
mads.data = mads.sel$asa[,]
#使用ASA确定alternative splicing events
param = jetta.do.asa(CELs,CLF,PGF,ProbesetAnnotationFile=PSA,SummarizationMethod="median_polish",SummarizationTarget='tc',ASATestType="unpaired",ASASampleClass=c(-1,-1,-1,1,1,1))
asa = jetta.read.asa.output(param)
asa.sel = jetta.asa.filtering(asa,tc.expr.cutoff=100,tc.fc.cutoff=1.5,dabg.cutoff=0.05,midas.cutoff=0.01,alternative.only=F)
asa.data = asa.sel$asa[,]

使用ARH
按照作者的说法,当ARH值本身代表着alternative splicing event发生的可能性,当它大于0.03时就意味着很有可能是一个splicing事件。

#ARH计算方法来自于原作者
ARH = function(x = "numeric", y = "numeric", f = "character", na.rm = FALSE) {
# INPUT
#	x, y: exon expression; numeric vectors, x,y >= 0
#	f : gene IDs corresponding to x,y; groups the exon expressions to genes, character
#	x, y, f all of the same length
#	na.rm: how do you want to treat NA in the expressions? R standard is here implemented, but we use 'na.rm = TRUE'.
# OUTPUT
#	ARH values, numeric vector, length(ARH) == length(unique(f)), names(ARH) == names(unique(f))

	# negative exon expressions not interpretable
	if(any(x < 0, na.rm = TRUE) | any(y < 0, na.rm = TRUE)) stop("What are negative expressions?")
	
	# avoid division by zero and arbitrarily big numbers
	y[ y < 0.0001 ] = 0.0001

	# compute splicing probabilities, each exon compared to gene
	splicingDeviations = log2(x / y)
	splicingDeviationsMedian = split(splicingDeviations, f)
	splicingDeviationsMedian = sapply(X = splicingDeviationsMedian, FUN = median, na.rm = na.rm)
	splicingDeviationsMedian = splicingDeviationsMedian[ match(f, names(splicingDeviationsMedian)) ]
	splicingDeviations = 2^abs( splicingDeviations - splicingDeviationsMedian )
	rm(splicingDeviationsMedian)
	splicingProbabilitiesSum = split(splicingDeviations, f)
	splicingProbabilitiesSum = sapply(splicingProbabilitiesSum, sum, na.rm = na.rm)
	splicingProbabilitiesSum = splicingProbabilitiesSum[ match(f, names(splicingProbabilitiesSum)) ]
	splicingProbabilities = splicingDeviations / splicingProbabilitiesSum
	rm(splicingDeviations, splicingProbabilitiesSum)
	
	# compute ARH, each gene	
	entropy = split(splicingProbabilities, f)
	entropy = entropy[ match(unique(f), names(entropy)) ]
	entropy = sapply(X = entropy, FUN = function(X) return( -sum(X * log2(X), na.rm = na.rm) ))
	
	iqrQuotient = x / y
	iqrQuotient = split(iqrQuotient, f)
	iqrQuotient = iqrQuotient[ match(unique(f), names(iqrQuotient)) ]
	iqrQuotient = sapply(X = iqrQuotient, FUN = quantile, probs = c(0.25, 0.75), na.rm = TRUE)
	iqrQuotient = iqrQuotient[ "75%" ,  ] / iqrQuotient[ "25%" ,  ]

	geneLength = table(f)
	geneLength = geneLength[ match(unique(f), names(geneLength)) ]

	arh = as.numeric(iqrQuotient * (log2(geneLength) - entropy))
	names(arh) = unique(f)
	
	# filter for at least two finite expression values
	# (1) no splicing in genes with only one exon
	# (2) infinite values lead to extreme splicing indication, hard to interprete
	good = split(is.finite(x) & is.finite(y) & is.finite(splicingProbabilities), f = f)
	good = sapply(good, function(X) return(sum(X) >= 2))
	good = good[ match(unique(f), names(good)) ]
	arh[ !good ] = NA
	
	return(arh)
}

probe_map<-annot.map[annot.map %in% row.names(d.gene.fil)]
keep<-names(probe_map)#去除没有对应基因的外显子
d.exon.fil3<-d.exon.fil2[row.names(d.exon.fil2) %in% keep,]
probe_int<-2^(d.exon.fil3)
probe_map<-cbind(names(probe_map),probe_map)
colnames(probe_map)<-c("exon_id","gene_id")
Entropy_treat = split(x = probe_int[  , chip_treat ], f = probe_map[  , "exon_id" ])
Entropy_treat = Entropy_treat[ match(x = unique(probe_map[  , "exon_id" ]), table = names(Entropy_treat)) ]
Entropy_treat = sapply(Entropy_treat, median, na.rm = TRUE)
Entropy_ctrl = split(x = probe_int[  , chip_ctrl ], f = probe_map[  , "exon_id" ])
Entropy_ctrl = Entropy_ctrl[ match(x = unique(probe_map[  , "exon_id" ]), table = names(Entropy_ctrl)) ]
Entropy_ctrl = sapply(Entropy_ctrl, median, na.rm = TRUE)
ARH = ARH(x = Entropy_treat, y = Entropy_ctrl, f = probe_map[ !duplicated(probe_map[  , "exon_id" ]) , "gene_id" ], na.rm = TRUE)
ARH.res<-cbind(probe_int,arh=ARH[match(annot.map[rownames(probe_int)],names(ARH))])
ARH.res<-ARH.res[ARH.res[,"arh"]>0.03,]
ARH.res<-ARH.res[order(-ARH.res[,"arh"]),]

10 thoughts on “bioconductor系列教程之三–Affymetrix 外显子(exon)分析综合

  1. 您好!之前您写的primeview.db帮助非常大,非常感谢!

    于此同时,我发现 在运用nsFilter时 似乎不能载入primeview的环境,不知如何解决,特向您请教。谢谢!!

  2. 请问一个粗浅的问题:Affymetrix 外显子(exon)芯片与传统的3‘芯片相比,有没有太多的优势?为什么现今仍有很多研究者使用传统的3‘芯片?为什么传统的3‘芯片没有被淘汰?

    1. 这个问题很难回答啊。我感觉原因有以下几点:
      1,设计不同。3‘IVT芯片(基因芯片)的对象是基因3’末端(覆盖大约3‘端600bp),每个基因只有1至2组探针组(probeset)。所以它的样品一般是以polyA为引物产生的cDNA。而exon array(外显子芯片)覆盖整个基因的所有外显子,其样品是mRNA,而后是以不同引物(而不是单一的polyA)逆转录产生的cDNA。所以基因芯片的研究对象只能是基因表达水平,而外显子芯片则可能是transcripts水平。要注意的是,基因芯片面对同一基因不同结尾时,会区别对待。
      2,实验步骤不同。正如前面提到的,它们主要的区别在于逆转录的引物。
      3,探针组成不同。基因芯片针的探针组由11PM和11MM探针组成,而外显子芯片则是4PM。平均每个基因,基因芯片有16个探针,而外显子芯片会有147个探针。虽然探针的总数相对于基因芯片来说提高了,但是其精确程度却很依赖相互验证,这在实验水平来说很难保证。这就意味着对其的分析会更难。
      4,成本不同。这是最主要人们在研究基因表达水平时倾向使用前者的原因。

  3. 您好!请问一个问题,我在使用oligo时,输入“affyRaw<-read.celfiles(celFiles)”,为什么总是提示:Warning message:package ‘pd.hugene.1.0.st.v1’ was built under R version 3.2.0 ?

  4. 您好,老师。首先,谢谢您的博文,内容非常好!
    请问在“确定alternative splicing events”这一步中“使用SI/LIMMA”的代码里首行里的 annot.map 之前未定义过,在我运行时会提示错误,这应该怎么改正呢?

    1. 现在对于SI/LIMMA中的laternative splicing events的处理,不再使用SI了,请自己查看limma::diffSplice函数,对你会有帮助的。

发表评论

电子邮件地址不会被公开。 必填项已用*标注