之前我介绍了illumina CASAVA 序列比对使用说明.如果现在的问题变成了,你想象使用命令一样直接分析任意的fastq, fasta,export, 或者qseq文件,而不需要由*.bcl经过转换而生成的fastq文件以及一些相关的辅助文件,应该怎么办?如果直接按照前文的作法,那么它一定会提醒你找不到DemultiplexedBustartdSummary.xml而无法继续运行。这个时候我们需要就是ELAND_standalone.pl程序了。
运行起来也很简单,命令为:
Path/to/CASAVA1.8/bin/ELAND_standalone.pl -if read1.fastq -if read2.fastq -ref /Path/to/reference/Genomes/E_coli_ELAND
我们看到,ELAND_standalone.pl只需要两个参数就可以跑起来:
- –input-file <input file>, -if <input file> 输入的fastq文件
- –ref-sequences <path to genome dir>, -ref <path to genome dir> 参考基因组文件所在的目录
除了上面的两个参数以外,还有许多参数可以具体控制ELAND_standalone.pl如何运行。
- –bam 输出BAM
- –base-quality <value>, -bq <value> 短序碱基数,默认值为30
- –copy-references, -cr 复制参考文件至输入出目录。使用这一参数的意义在于如果你的参考目录不可写的话,那么就需要使用该参数
- –force 如果输出目录有同名文件的话,直接覆盖。
- –input-type <input format>, -it <input format> 输入文件格式,可以是FASTQ, FASTA, export, 或者qseq
- –log <path to log>, -l <path to log> 日志文件名,默认值为ELAND_standalone.log
- –output-directory <output dir>, -od <output dir> 输出目录
- –output-prefix <prefix>, -op <prefix> 输出文件名的前缀,默认值为reanalysis
- –kagu-options <“options”>, -ko <“option”> 传递给paired-read analysis的参数。如果是多参数的话,需要使用引号
- –remove-temps, -rt 如果运算成功的话,删除所有报告文件,BAM文件以及log文件以外的文件
- –seed-length <value>, -sl <value> 种子的长度。一般来说,应该少于短序列的长度,也需要小于32。通常设为11.
- –use-bases <value>, -ub <value> 短序列处理方式。默认值为Y*n。具体可以参考前文
- –help, -h 输入出当前帮助文件。
完整例:
Path/to/CASAVA_v1.8.0/bin/ELAND_standalone.pl -if Path/to/Unaligned/GM12878RNAseq1.fastq -if Path/to/Unaligned/GM12878RNAseq2.fastq -it FASTQ -od Path/to/casava_output/ -op GM12878 -sl 11 -bq 34 -cr -ref /Path/to/Genomes/hg19_fa/ |
如果你已经有了转换好的参考文件,那么你可以直接使用命令:eland_ms命令。
/share/apps/CASAVA_v1.8.0/bin/eland_ms –oligo-length=11 –data-format fastq –lane 1 –read 1 –qseq-mask YYYY
YYYYYYYYYYYYYYYYYYYYYYYYYYYYYn –cluster-sets 001 –sample GM12878 –barcode NoIndex –base-calls-dir /scratch2/ouj/casava_output –genome-directory /scratch2/ouj/casava_output/references
–output-file /scratch2/ouj/casava_output/GM12878_NoIndex_L001_R1_001_eland_extended.txt –multi 2>&1
帮助文件如下:
Usage: eland_ms_0 oligoFile genomeDirectory outputFile[.vmf] [options] or: eland_ms_0 --qseq-source genomeDirectory outputFile[.vmf] [options] tile1 [tile2 [.. tileN]] oligoFile - file or directory of files file type deduced from first character of each file: '>' - fasta format '#' - single molecule array format [AGCTNagctn] - raw sequence format genomeDirectory - directory of genome files preprocessed to 2-bits-per-base format using squashGenome outputFile - name of output file if name ends in '.vmf', use verbose match format, else use format required by assembly module tile{1..N} - list of tiles to process (only used when reading qseq files) Command line options: -h [ --help ] produce help message and exit --multi [=arg(=10)] [=N0[,N1,N2]] Output multiple hits per read. At most N0,N1,N2 exact, 1-mismatch, 2-mismatch hits per read. --repeat-file arg if given, points to a file containing the list of repeats to exclude (must be ASCII and in alphabetical order) --ungapped output ungapped alignments instead of gapped --singleseed do not use multiple seeds per read --debug write the multi files --sensitive increase sensitivity --lane arg lane number (only used when reading qseq or bcl files) --read arg read number (only used when reading qseq or bcl files) --tiles arg list of tiles (only used when reading qseq or bcl files) --sample arg (=Sample) sample name (for use with --data-format=fastq --barcode arg (=empty) barcode (for use with --data-format=fastq --cluster-sets arg list of decimal cluster set numbers (for use with --data-format=fastq --instrument-name arg (=unknown-instrument) instrument name to use for the identification of the sequences (bcl input only) --run-number arg (=0) run-number to use for the identification of the sequences (bcl input only) --data-format arg (=bcl) format of the input data (bcl, qseq, fastq, fasta) --oligo-file arg file containing the data (only for fastq and fasta format) --base-calls-dir arg (=.) path to Base Calls directory --filter-directory arg directory containing the filter files, if different from the base calls directory (only for bcl input) --positions-directory arg directory containing the positions files, if different from the parent of the base calls directory (only for bcl input) --positions-format arg (=locs) format of the position files, either 'locs', 'clocs' or 'txt' (only for bcl input) --output-file arg full path to the output file --tmp-file-prefix arg path (including the file name) to form the temporary file paths. If unspecified, eland will create unique files in system temporary folder. --genome-directory arg directory containing the squashed reference files --cycles arg list of cycles to align (only for bcl input) --qseq-mask arg conversion mask - 'Y' (or 'y'), 'N' (or 'n') for 'use' or 'discard' respectively (only used when reading qseq files) --oligo-length arg Seed length. Valid range is [8-32] |