确定alternative splicing events
使用MiDAS
计算MiDAS需要使用到APT.为些我们需要准备一个cels.txt文件。在R中
cel_files<-colnames(d.gene) group_id<-c(rep("ctrl",5),rep("expr",5)) cels<-cbind(cel_files,group_id) write.table(cels,"cels.txt",quote=F,sep="\t",row.names=F)
然后运行APT
apt-midas –cel-files cels.txt –g ./OUT_GENE/rma-sketch_core_gene.summary_filtered.txt –e ./OUT_EXON/rmasketch_core_exon.summary_filtered.txt –m HuEx-1_0-st-v2.r2.core.mps –o ./OUT_MIDAS –nol
最后我们将其结果读入R,并以q.value<0.05为条件确定alternative splicing events。
midas <- read.table("midas.pvalues.txt", sep="\t", header=T) midas <- midas[midas$pvalue<0.05,] o <- order(midas$pvalue) midas.ordered <- midas[o,]
使用SI/LIMMA
ni<-d.exon.fil2-d.gene.fil[annot.map[row.names(d.exon.fil2)],]#计算NI值 ni<-ni[!is.na(ni[,1]),]#去除被filter掉的基因中的外显子 g<-factor(c(rep("ctrl",length(ctrl)),rep("expr",length(expr))))#设置design design<-model.matrix(~-1+g) colnames(design)<-gsub("^g","",colnames(design))#设置contrast.matrix constr<-"expr-ctrl" contrast.matrix<-makeContrasts(contrasts=constr,levels=design) fit<-lmFit(ni,design)#do fit fit1<-contrasts.fit(fit,contrast.matrix) fit2<-eBayes(fit1) limma.res<-topTable(fit2,coef=constr,n=nrow(fit2)) limma.res<-limma.res[limma.res$adj.P.Val<0.01,] write.csv(limma.res,"si.limma.csv",row.names=F)
使用FIRMA
firma<-FirmaModel(plmTr) fit(firma,verbose=verbose) fs<-getFirmaScores(firma) fsRes<-extractDataFrame(fs, units=NULL, addNames=TRUE) fsLog2<-log2(fsRes[,6:ncol(fsRes)]) fsLog2<-cbind(fsRes[,1:5],fsLog2) write.csv(fsLog2,"FIRMA.score.core.csv",row.names=F)
使用MADS以及ASA
#使用MADS确定alternative splicing events param = jetta.do.mads(CELs,CLF,PGF,ProbesetAnnotationFile=PSA,SummarizationMethod="median_polish",SummarizationTarget='tc',MADSTTestType="unpaired",MADSSampleClass=c(-1,-1,-1,1,1,1)) mads = jetta.read.mads.output(param) mads.sel = jetta.asa.filtering(mads,tc.expr.cutoff=100,tc.fc.cutoff=1.5,dabg.cutoff=0.05,mads.cutoff=0.001,alternative.only=F) mads.data = mads.sel$asa[,] #使用ASA确定alternative splicing events param = jetta.do.asa(CELs,CLF,PGF,ProbesetAnnotationFile=PSA,SummarizationMethod="median_polish",SummarizationTarget='tc',ASATestType="unpaired",ASASampleClass=c(-1,-1,-1,1,1,1)) asa = jetta.read.asa.output(param) asa.sel = jetta.asa.filtering(asa,tc.expr.cutoff=100,tc.fc.cutoff=1.5,dabg.cutoff=0.05,midas.cutoff=0.01,alternative.only=F) asa.data = asa.sel$asa[,]
使用ARH
按照作者的说法,当ARH值本身代表着alternative splicing event发生的可能性,当它大于0.03时就意味着很有可能是一个splicing事件。
#ARH计算方法来自于原作者 ARH = function(x = "numeric", y = "numeric", f = "character", na.rm = FALSE) { # INPUT # x, y: exon expression; numeric vectors, x,y >= 0 # f : gene IDs corresponding to x,y; groups the exon expressions to genes, character # x, y, f all of the same length # na.rm: how do you want to treat NA in the expressions? R standard is here implemented, but we use 'na.rm = TRUE'. # OUTPUT # ARH values, numeric vector, length(ARH) == length(unique(f)), names(ARH) == names(unique(f)) # negative exon expressions not interpretable if(any(x < 0, na.rm = TRUE) | any(y < 0, na.rm = TRUE)) stop("What are negative expressions?") # avoid division by zero and arbitrarily big numbers y[ y < 0.0001 ] = 0.0001 # compute splicing probabilities, each exon compared to gene splicingDeviations = log2(x / y) splicingDeviationsMedian = split(splicingDeviations, f) splicingDeviationsMedian = sapply(X = splicingDeviationsMedian, FUN = median, na.rm = na.rm) splicingDeviationsMedian = splicingDeviationsMedian[ match(f, names(splicingDeviationsMedian)) ] splicingDeviations = 2^abs( splicingDeviations - splicingDeviationsMedian ) rm(splicingDeviationsMedian) splicingProbabilitiesSum = split(splicingDeviations, f) splicingProbabilitiesSum = sapply(splicingProbabilitiesSum, sum, na.rm = na.rm) splicingProbabilitiesSum = splicingProbabilitiesSum[ match(f, names(splicingProbabilitiesSum)) ] splicingProbabilities = splicingDeviations / splicingProbabilitiesSum rm(splicingDeviations, splicingProbabilitiesSum) # compute ARH, each gene entropy = split(splicingProbabilities, f) entropy = entropy[ match(unique(f), names(entropy)) ] entropy = sapply(X = entropy, FUN = function(X) return( -sum(X * log2(X), na.rm = na.rm) )) iqrQuotient = x / y iqrQuotient = split(iqrQuotient, f) iqrQuotient = iqrQuotient[ match(unique(f), names(iqrQuotient)) ] iqrQuotient = sapply(X = iqrQuotient, FUN = quantile, probs = c(0.25, 0.75), na.rm = TRUE) iqrQuotient = iqrQuotient[ "75%" , ] / iqrQuotient[ "25%" , ] geneLength = table(f) geneLength = geneLength[ match(unique(f), names(geneLength)) ] arh = as.numeric(iqrQuotient * (log2(geneLength) - entropy)) names(arh) = unique(f) # filter for at least two finite expression values # (1) no splicing in genes with only one exon # (2) infinite values lead to extreme splicing indication, hard to interprete good = split(is.finite(x) & is.finite(y) & is.finite(splicingProbabilities), f = f) good = sapply(good, function(X) return(sum(X) >= 2)) good = good[ match(unique(f), names(good)) ] arh[ !good ] = NA return(arh) } probe_map<-annot.map[annot.map %in% row.names(d.gene.fil)] keep<-names(probe_map)#去除没有对应基因的外显子 d.exon.fil3<-d.exon.fil2[row.names(d.exon.fil2) %in% keep,] probe_int<-2^(d.exon.fil3) probe_map<-cbind(names(probe_map),probe_map) colnames(probe_map)<-c("exon_id","gene_id") Entropy_treat = split(x = probe_int[ , chip_treat ], f = probe_map[ , "exon_id" ]) Entropy_treat = Entropy_treat[ match(x = unique(probe_map[ , "exon_id" ]), table = names(Entropy_treat)) ] Entropy_treat = sapply(Entropy_treat, median, na.rm = TRUE) Entropy_ctrl = split(x = probe_int[ , chip_ctrl ], f = probe_map[ , "exon_id" ]) Entropy_ctrl = Entropy_ctrl[ match(x = unique(probe_map[ , "exon_id" ]), table = names(Entropy_ctrl)) ] Entropy_ctrl = sapply(Entropy_ctrl, median, na.rm = TRUE) ARH = ARH(x = Entropy_treat, y = Entropy_ctrl, f = probe_map[ !duplicated(probe_map[ , "exon_id" ]) , "gene_id" ], na.rm = TRUE) ARH.res<-cbind(probe_int,arh=ARH[match(annot.map[rownames(probe_int)],names(ARH))]) ARH.res<-ARH.res[ARH.res[,"arh"]>0.03,] ARH.res<-ARH.res[order(-ARH.res[,"arh"]),]
您好!之前您写的primeview.db帮助非常大,非常感谢!
于此同时,我发现 在运用nsFilter时 似乎不能载入primeview的环境,不知如何解决,特向您请教。谢谢!!
请问一个粗浅的问题:Affymetrix 外显子(exon)芯片与传统的3‘芯片相比,有没有太多的优势?为什么现今仍有很多研究者使用传统的3‘芯片?为什么传统的3‘芯片没有被淘汰?
这个问题很难回答啊。我感觉原因有以下几点:
1,设计不同。3‘IVT芯片(基因芯片)的对象是基因3’末端(覆盖大约3‘端600bp),每个基因只有1至2组探针组(probeset)。所以它的样品一般是以polyA为引物产生的cDNA。而exon array(外显子芯片)覆盖整个基因的所有外显子,其样品是mRNA,而后是以不同引物(而不是单一的polyA)逆转录产生的cDNA。所以基因芯片的研究对象只能是基因表达水平,而外显子芯片则可能是transcripts水平。要注意的是,基因芯片面对同一基因不同结尾时,会区别对待。
2,实验步骤不同。正如前面提到的,它们主要的区别在于逆转录的引物。
3,探针组成不同。基因芯片针的探针组由11PM和11MM探针组成,而外显子芯片则是4PM。平均每个基因,基因芯片有16个探针,而外显子芯片会有147个探针。虽然探针的总数相对于基因芯片来说提高了,但是其精确程度却很依赖相互验证,这在实验水平来说很难保证。这就意味着对其的分析会更难。
4,成本不同。这是最主要人们在研究基因表达水平时倾向使用前者的原因。
收益非浅,衷心感谢热心回复
您好!请问一个问题,我在使用oligo时,输入“affyRaw<-read.celfiles(celFiles)”,为什么总是提示:Warning message:package ‘pd.hugene.1.0.st.v1’ was built under R version 3.2.0 ?
这是版本冲突提示。有些时候可能会导致错误的结果。
您好,老师。首先,谢谢您的博文,内容非常好!
请问在“确定alternative splicing events”这一步中“使用SI/LIMMA”的代码里首行里的 annot.map 之前未定义过,在我运行时会提示错误,这应该怎么改正呢?
现在对于SI/LIMMA中的laternative splicing events的处理,不再使用SI了,请自己查看limma::diffSplice函数,对你会有帮助的。
想请问老师,那个自动化处理流程的密码是什么啊?
已发送,请查收。
请问这一步使用APT获取dabg值如何操作啊?谢谢。
这个我帮不到你了。我已经都忘记怎么操作的了。