Star-index-align_2.5.3.a

Please work through the tutorial and add your comments on the bottom of this page. Or send comments per email to kchougul@cshl.edu. Thank you.

Rationale and background:

 

STAR: ultrafast universal RNA-seq aligner

Alexander Dobin,1,* Carrie A. Davis,1 Felix Schlesinger,1 Jorg Drenkow,1 Chris Zaleski,1 Sonali Jha,1 Philippe Batut,1 Mark Chaisson,2 and Thomas R. Gingeras

doi:  10.1093/bioinformatics/bts635

 

Spliced Transcripts Alignment to a Reference (STAR) software is another highly cited splice-ware aligner. It scores above the other aligners in terms of its speed of alignment. Its algorithm uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR can be used in two-pass mapping to improve splice junction accuracy i.e supplying the splice loci found in first pass to into the second mapping pass. STAR also works well will with long reads and has a comparable accuracy with BLAT which is used to mapped long reads. STAR mapping workflow involves two steps i.e generating genome index files and then mapping the reads against the genome.  This app will do the both the index and alignment of reads against the reference genome.


Version: 2.5.3.a


Pre-Requisites

  1. A CyVerse account. (Register for an CyVerse account here - user.cyverse.org)
  2. Mandatory arguments 
    1. Genome reference sequence file name (in fasta format)
    2. Genome reference annotation file name(in gtf format)
    3. FASTQ files ( PE or SE reads) 
    4. File type (paired-PE or single-SE )
  3. Optional arguments
    1. Output bam sorting: SortedByCoordinate (This is sort the bam file by coordinate useful for downstream analysis)
    2. output quantification method: output SAM/BAM alignments to transcriptome- (types of quantification requested

    3. compatibility with Cufflinks and StringTie: set this this to 0 for compatibility-
    4. max number of multiple alignments allowed for a read:20
    5. minimum overhang for unannotated junctions:8
    6. minimum overhang for annotated junctions:1
    7. maximum number of mismatches per pair:999
    8. minimum intron length: 20
    9. maximum intron length: 1000000
    10. maximum genomic distance between mates:1000000
Test/sample data 

The following test data are provided for testing Star-index-align_2.5.3.a in here - /iplant/home/shared/iplantcollaborative/example_data/Star/STAR-2.5.2:

  1. reference genome file - reference.fasta
  2. reference gtf file- reference.gtf
  3. Directory of FASTQ files in (fastq,fq,gz,bz2) -
    1.  reads/sample1.1.fastq.gz reads/sample1.2.fastq.gz

Run Star-index-align_2.5.3.a on FASTQ files using reference files.

Results 

Successful execution of the Star-index-align_2.5.3.a will contain several files and directories:

  • index: STAR genome indices

  • bam_output: all sample only bam files in directory
    • sample1.Aligned.sortedByCoord.out.bam
  • output: individual sample
  • STAR_output: Default output files from STAR which includes

    • Log.out: main log file with a lot of detailed information about the run. This file is most useful for troubleshooting and debugging.

    • sample1.Log.progress.out: reports job progress statistics, such as the number of processed reads, % of mapped reads etc.

    • sample1.Log.final.out: summary mapping statistics after mapping job is complete, very useful for quality control.

    • sample1.Aligned.out.bam - alignments in standard SAM format.

    • sample1.SJ.out.tab- only those reads that contain junctions.

    • sample1.Unmapped.out.mate1- output of unmapped and partially mapped (i.e. mapped only one mate of a paired end read) reads in separate file(s)
    • sample1.Unmapped.out.mate2-output of unmapped and partially mapped (i.e. mapped only one mate of a paired end read) reads in separate file(s)


More information on the tool can be found here - https://github.com/alexdobin/STAR