既然已经可以从利用API成功调取数据了,我们应该开始自己的学习过程了。据以前的体会,学习一门语言,必须带着问题去学才能有效果。于是在学习前,为自己留下一个作业,作业题目是,
假设现在有两比较样本的microarray芯片的资料需要分析,资料以列表的形式表示,列表示例如下:
1_Signal | 1_Detection | 2_Signal | 2_Detection | Descriptions | |
1771550_at | 66.8 | P | 67.2 | A | S. cerevisiae YAL064W-B GEN=SEO1 DB_XREF=GI:6319253 SEG=NC_001133:+12047,12427 DEF=Hypothetical ORF NOTE=Yal064w-bp; go_component: cellular_component unknown [goid GO:0008372] [evidence ND]; go_function: molecular_function unknown [goid GO:0005554] [evidence ND]; go_process: biological_process unknown [goid GO:0000004] [evidence ND] |
1772356_at | 23.8 | A | 13.6 | A | S. cerevisiae YAL064C-A GEN=SEO1 DB_XREF=GI:7839146 SEG=NC_001133:-13364,13744 DEF=Hypothetical ORF NOTE=Yal064c-ap; go_component: cellular_component unknown [goid GO:0008372] [evidence ND]; go_function: molecular_function unknown [goid GO:0005554] [evidence ND]; go_process: biological_process unknown [goid GO:0000004] [evidence ND] |
1772528_at | 7.3 | A | 3.2 | A | S. cerevisiae YAL064W GEN=SEO1 DB_XREF=GI:6319254 SEG=NC_001133:+21526,21852 DEF=Hypothetical ORF NOTE=Yal064wp; go_component: cellular_component unknown [goid GO:0008372] [evidence ND]; go_function: molecular_function unknown [goid GO:0005554] [evidence ND]; go_process: biological_process unknown [goid GO:0000004] [evidence ND] |
1776321_at | 120.9 | P | 40.3 | A | S. cerevisiae YAL063C-A GEN=SEO1 DB_XREF=GI:33438755 SEG=NC_001133:-22397,22687 DEF=Identified by expression profiling and mass spectrometry NOTE=Yal063c-ap; go_component: cellular_component unknown [goid GO:0008372] [evidence ND]; go_function: molecular_function unknown [goid GO:0005554] [evidence ND]; go_process: biological_process unknown [goid GO:0000004] [evidence ND] |
1779679_s_at | 197.5 | P | 204.6 | P | S. cerevisiae YAL063C GEN=FLO9 DB_XREF=GI:6319255 SEG=NC_001133:-24001,27969 DEF=Lectin-like protein with similarity to Flo1p, thought to be expressed and involved in flocculation NOTE=Flo9p; go_component: cell wall (sensu Fungi) [goid GO:0009277] [evidence ISS] [pmid 7502576]; go_function: mannose binding [goid GO:0005537] [evidence ISS] [pmid 7502576]; go_process: flocculation (sensu Saccharomyces) [goid GO:0000501] [evidence IEP,ISS] [pmid 7502576] |
列表中第一列为microarray中的序列编号,第二,四列为信号强度,第三,五列信号可识别度,P为良好,A为差,第六列为信息说明。需要比较两次实验中信号良好的两信号差别,差异显著的,使用KEGG的代谢算途径库图象化表示,并提供统计表,清晰描绘两次实验的差异。
先把作业分析一下,大约是要先逐行读取文件,然后依据表格分格符将数据分割成六列,当第三与第五列均为P时比较第二,四列信号强度差距,差别在一个数量级以上时,从最后一列中读取 DB_XREF后的数据库号,通过它在KEGG中查找相应的代谢信息,做形象化处理。
今天要完成的就是数据的读取工作。其余的工作以后几课再做。
设文件名为$filename=”c\TPS1NS.xls”;
程序代码如下:
#!/usr/bin/perl #使用SOAP::Lite use SOAP::Lite; #为SOAP准备wsdl参数 $wsdl = 'http://soap.genome.jp/KEGG.wsdl'; #定义指针提供KEGG API接口 $soap_ser = SOAP::Lite->service($wsdl); #定义文件名(包括完整路径) $filename = "C:\\TPS1NS.xls"; #打开文件,如果打不开中止程序。 open(MICROARRAY, $filename)||die ("Could not open file"); #统计有多少基因上调,有多少基因下调 $up_reg = 0; $down_reg = 0; #逐行读取文件 while($line = <MICROARRAY>){ #文件列表分格符为制表符 $pattern = "\x09"; #将读取的一行数据按定义好的分格符分割成字串存在words为名的数组里。 @words = split(/$pattern/, $line); #选择信号显著的数据进行比较 if ($words[2]=~/P/ && $words[4]=~ /P/){ #计算数值的数量级差别 $ratio=log($words[1])-log($words[3]); #当数量级差别大于等于1时,对数据进行处理 if(abs($ratio)>=1){ #统计有多少基因上调或者下调表达 if($ratio>=1){ $up_reg += 1; } if($ratio<=-1){ $down_reg +=1; } #查找字符串以DB_XREF=开头,以空格结束的字串, #中间的部分,前面的字母将存在$1变量中,后面的数字将存在$2变量中。 if($words[5]=~/DB_XREF=(\D+)(\d+)\s/){ #因为变量$1,$2的作用域有限,所以将它们先存起来。 $db_pre=$1; $db_id=$2; #查找microarray当中注释数据库的前缀,不同的生物使用的数据库不同。本例中使用的是NCBI GI数据。 if($db_pre=~/GI/){ #将外部数据库ID转换成KEGG数据库ID $kegg_id = $soap_ser->bconv("ncbi-gi:$db_id"); #输出结果 print $kegg_id; } } } } } #关闭文件 close(MICROARRAY); #输出统计结果 print $up_reg." genes up regulated. And ".$down_reg." genes down regulated."; print "\nThis is the end!"; |
运行这个程序,看到输出结果为:
ncbi-gi:6319269 sce:YAL047C equivalent
ncbi-gi:6319323 sce:YAR009C equivalent
ncbi-gi:6319369 sce:YBL100W-B equivalent
ncbi-gi:6319605 sce:YBR129C equivalent
ncbi-gi:7839147 sce:YCR024C-A equivalent
ncbi-gi:6319961 sce:YDL240W equivalent
ncbi-gi:6320437 sce:YDR231C equivalent
ncbi-gi:7839158 sce:YDR261W-B equivalent
ncbi-gi:6320531 sce:YDR324C equivalent
ncbi-gi:6320656 sce:YDR448W equivalent
ncbi-gi:6320676 sce:YDR468C equivalent
ncbi-gi:6320819 sce:YEL017C-A equivalent
ncbi-gi:6320844 sce:YER007C-A equivalent
ncbi-gi:6320986 sce:YER138W-A equivalent
ncbi-gi:14318454 sce:YFL065C equivalent
ncbi-gi:14318511 sce:YFL010W-A equivalent
ncbi-gi:6321196 sce:YGL241W equivalent
ncbi-gi:41629691 sce:YGL201C equivalent
ncbi-gi:6321656 sce:YGR217W equivalent
ncbi-gi:6321739 sce:YHL048W equivalent
ncbi-gi:6321833 sce:YHR043C equivalent
ncbi-gi:33438808 sce:YHR175W-A equivalent
ncbi-gi:6322648 sce:YKL201C equivalent
ncbi-gi:6322758 sce:YKL092C equivalent
ncbi-gi:6323371 sce:YLR340W equivalent
ncbi-gi:33438853 sce:YMR030W-A equivalent
ncbi-gi:6323957 sce:YMR299C equivalent
ncbi-gi:6324623 sce:YOR049C equivalent
ncbi-gi:6324767 sce:YOR193W equivalent
ncbi-gi:6324948 sce:YOR372C equivalent
31 genes up regulated. And 0 genes down regulated.
This is the end!
ncbi-gi:6319323 sce:YAR009C equivalent
ncbi-gi:6319369 sce:YBL100W-B equivalent
ncbi-gi:6319605 sce:YBR129C equivalent
ncbi-gi:7839147 sce:YCR024C-A equivalent
ncbi-gi:6319961 sce:YDL240W equivalent
ncbi-gi:6320437 sce:YDR231C equivalent
ncbi-gi:7839158 sce:YDR261W-B equivalent
ncbi-gi:6320531 sce:YDR324C equivalent
ncbi-gi:6320656 sce:YDR448W equivalent
ncbi-gi:6320676 sce:YDR468C equivalent
ncbi-gi:6320819 sce:YEL017C-A equivalent
ncbi-gi:6320844 sce:YER007C-A equivalent
ncbi-gi:6320986 sce:YER138W-A equivalent
ncbi-gi:14318454 sce:YFL065C equivalent
ncbi-gi:14318511 sce:YFL010W-A equivalent
ncbi-gi:6321196 sce:YGL241W equivalent
ncbi-gi:41629691 sce:YGL201C equivalent
ncbi-gi:6321656 sce:YGR217W equivalent
ncbi-gi:6321739 sce:YHL048W equivalent
ncbi-gi:6321833 sce:YHR043C equivalent
ncbi-gi:33438808 sce:YHR175W-A equivalent
ncbi-gi:6322648 sce:YKL201C equivalent
ncbi-gi:6322758 sce:YKL092C equivalent
ncbi-gi:6323371 sce:YLR340W equivalent
ncbi-gi:33438853 sce:YMR030W-A equivalent
ncbi-gi:6323957 sce:YMR299C equivalent
ncbi-gi:6324623 sce:YOR049C equivalent
ncbi-gi:6324767 sce:YOR193W equivalent
ncbi-gi:6324948 sce:YOR372C equivalent
31 genes up regulated. And 0 genes down regulated.
This is the end!
基本上,今天的任务完成了。