使用illumina CASAVA 分析任意fastq文件

之前我介绍了illumina CASAVA 序列比对使用说明.如果现在的问题变成了,你想象使用命令一样直接分析任意的fastq, fasta,export, 或者qseq文件,而不需要由*.bcl经过转换而生成的fastq文件以及一些相关的辅助文件,应该怎么办?如果直接按照前文的作法,那么它一定会提醒你找不到DemultiplexedBustartdSummary.xml而无法继续运行。这个时候我们需要就是ELAND_standalone.pl程序了。

运行起来也很简单,命令为:

Path/to/CASAVA1.8/bin/ELAND_standalone.pl -if read1.fastq -if read2.fastq -ref /Path/to/reference/Genomes/E_coli_ELAND

我们看到,ELAND_standalone.pl只需要两个参数就可以跑起来:

  1. –input-file <input file>, -if <input file>  输入的fastq文件
  2. –ref-sequences <path to genome dir>, -ref <path to genome dir>  参考基因组文件所在的目录

除了上面的两个参数以外,还有许多参数可以具体控制ELAND_standalone.pl如何运行。

  • –bam  输出BAM
  • –base-quality <value>, -bq <value>  短序碱基数,默认值为30
  • –copy-references, -cr  复制参考文件至输入出目录。使用这一参数的意义在于如果你的参考目录不可写的话,那么就需要使用该参数
  • –force  如果输出目录有同名文件的话,直接覆盖。
  • –input-type <input format>, -it <input format>  输入文件格式,可以是FASTQ, FASTA, export, 或者qseq
  • –log <path to log>, -l <path to log>  日志文件名,默认值为ELAND_standalone.log
  • –output-directory <output dir>, -od <output dir>  输出目录
  • –output-prefix <prefix>, -op <prefix>  输出文件名的前缀,默认值为reanalysis
  • –kagu-options <“options”>, -ko <“option”>  传递给paired-read analysis的参数。如果是多参数的话,需要使用引号
  • –remove-temps, -rt  如果运算成功的话,删除所有报告文件,BAM文件以及log文件以外的文件
  • –seed-length <value>, -sl <value>  种子的长度。一般来说,应该少于短序列的长度,也需要小于32。通常设为11.
  • –use-bases <value>, -ub <value>  短序列处理方式。默认值为Y*n。具体可以参考前文
  • –help, -h  输入出当前帮助文件。

完整例:

Path/to/CASAVA_v1.8.0/bin/ELAND_standalone.pl 
        -if Path/to/Unaligned/GM12878RNAseq1.fastq 
        -if Path/to/Unaligned/GM12878RNAseq2.fastq 
        -it FASTQ -od Path/to/casava_output/ -op GM12878 
        -sl 11 -bq 34 -cr -ref /Path/to/Genomes/hg19_fa/

如果你已经有了转换好的参考文件,那么你可以直接使用命令:eland_ms命令。

/share/apps/CASAVA_v1.8.0/bin/eland_ms –oligo-length=11 –data-format fastq –lane 1 –read 1 –qseq-mask YYYY
YYYYYYYYYYYYYYYYYYYYYYYYYYYYYn –cluster-sets 001 –sample GM12878 –barcode NoIndex –base-calls-dir /scratch2/ouj/casava_output –genome-directory /scratch2/ouj/casava_output/references
–output-file /scratch2/ouj/casava_output/GM12878_NoIndex_L001_R1_001_eland_extended.txt –multi 2>&1

帮助文件如下:

Usage: eland_ms_0 oligoFile genomeDirectory outputFile[.vmf] [options]
   or: eland_ms_0 --qseq-source genomeDirectory outputFile[.vmf] [options] tile1 [tile2 [.. tileN]]
 
oligoFile - file or directory of files
  file type deduced from first character of each file:
  '&gt;' - fasta format
  '#' - single molecule array format
  [AGCTNagctn] - raw sequence format
 
genomeDirectory - directory of genome files
  preprocessed to 2-bits-per-base format using squashGenome
 
outputFile - name of output file
  if name ends in '.vmf', use verbose match format,
  else use format required by assembly module
 
tile{1..N} - list of tiles to process
  (only used when reading qseq files)
 
Command line options:
  -h [ --help ]                         produce help message and exit
  --multi [=arg(=10)]                   [=N0[,N1,N2]]
                                        Output multiple hits per read. At most 
                                        N0,N1,N2 exact, 1-mismatch, 2-mismatch 
                                        hits per read.
  --repeat-file arg                     if given, points to a file containing 
                                        the list of repeats to exclude (must be
                                        ASCII and in alphabetical order)
  --ungapped                            output ungapped alignments instead of 
                                        gapped
  --singleseed                          do not use multiple seeds per read
  --debug                               write the multi files
  --sensitive                           increase sensitivity
  --lane arg                            lane number (only used when reading 
                                        qseq or bcl files)
  --read arg                            read number (only used when reading 
                                        qseq or bcl files)
  --tiles arg                           list of tiles (only used when reading 
                                        qseq or bcl files)
  --sample arg (=Sample)                sample name (for use with 
                                        --data-format=fastq
  --barcode arg (=empty)                barcode (for use with 
                                        --data-format=fastq
  --cluster-sets arg                    list of decimal cluster set numbers 
                                        (for use with --data-format=fastq
  --instrument-name arg (=unknown-instrument)
                                        instrument name to use for the 
                                        identification of the sequences (bcl 
                                        input only)
  --run-number arg (=0)                 run-number to use for the 
                                        identification of the sequences (bcl 
                                        input only)
  --data-format arg (=bcl)              format of the input data (bcl, qseq, 
                                        fastq, fasta)
  --oligo-file arg                      file containing the data (only for 
                                        fastq and fasta format)
  --base-calls-dir arg (=.)             path to Base Calls directory
  --filter-directory arg                directory containing the filter files, 
                                        if different from the base calls 
                                        directory (only for bcl input)
  --positions-directory arg             directory containing the positions 
                                        files, if different from the parent of 
                                        the base calls directory (only for bcl 
                                        input)
  --positions-format arg (=locs)        format of the position files, either 
                                        'locs', 'clocs' or 'txt' (only for bcl 
                                        input)
  --output-file arg                     full path to the output file
  --tmp-file-prefix arg                 path (including the file name) to form 
                                        the temporary file paths. If 
                                        unspecified, eland will create unique 
                                        files in system temporary folder.
  --genome-directory arg                directory containing the squashed 
                                        reference files
  --cycles arg                          list of cycles to align (only for bcl 
                                        input)
  --qseq-mask arg                       conversion mask - 'Y' (or 'y'), 'N' (or
                                        'n') for 'use' or 'discard' 
                                        respectively (only used when reading 
                                        qseq files)
  --oligo-length arg                    Seed length. Valid range is [8-32]

Leave a Reply

  

  

  

%d 博主赞过: