Star-index-align_2.5.3.a
Rationale and background:
STAR: ultrafast universal RNA-seq aligner
Alexander Dobin,1,* Carrie A. Davis,1 Felix Schlesinger,1 Jorg Drenkow,1 Chris Zaleski,1 Sonali Jha,1 Philippe Batut,1 Mark Chaisson,2 and Thomas R. Gingeras
doi: 10.1093/bioinformatics/bts635
Spliced Transcripts Alignment to a Reference (STAR) software is another highly cited splice-ware aligner. It scores above the other aligners in terms of its speed of alignment. Its algorithm uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR can be used in two-pass mapping to improve splice junction accuracy i.e supplying the splice loci found in first pass to into the second mapping pass. STAR also works well will with long reads and has a comparable accuracy with BLAT which is used to mapped long reads. STAR mapping workflow involves two steps i.e generating genome index files and then mapping the reads against the genome. This app will do the both the index and alignment of reads against the reference genome.
Version: 2.5.3.a
Pre-Requisites
- A CyVerse account. (Register for an CyVerse account here - user.cyverse.org)
- Mandatory arguments
- Genome reference sequence file name (in fasta format)
- Genome reference annotation file name(in gtf format)
- FASTQ files ( PE or SE reads)
- File type (paired-PE or single-SE )
- Optional arguments
- Output bam sorting: SortedByCoordinate (This is sort the bam file by coordinate useful for downstream analysis)
output quantification method: output SAM/BAM alignments to transcriptome- (types of quantification requested
- compatibility with Cufflinks and StringTie: set this this to 0 for compatibility-
- max number of multiple alignments allowed for a read:20
- minimum overhang for unannotated junctions:8
- minimum overhang for annotated junctions:1
- maximum number of mismatches per pair:999
- minimum intron length: 20
- maximum intron length: 1000000
- maximum genomic distance between mates:1000000
The following test data are provided for testing Star-index-align_2.5.3.a in here - /iplant/home/shared/iplantcollaborative/example_data/Star/STAR-2.5.2:
- reference genome file - reference.fasta
- reference gtf file- reference.gtf
- Directory of FASTQ files in (fastq,fq,gz,bz2) -
- reads/sample1.1.fastq.gz reads/sample1.2.fastq.gz
- reads/sample1.1.fastq.gz reads/sample1.2.fastq.gz
Run Star-index-align_2.5.3.a on FASTQ files using reference files.
Results
Successful execution of the Star-index-align_2.5.3.a will contain several files and directories:
index: STAR genome indices
- bam_output: all sample only bam files in directory
- sample1.Aligned.sortedByCoord.out.bam
- sample1.Aligned.sortedByCoord.out.bam
- output: individual sample
STAR_output: Default output files from STAR which includes
Log.out: main log file with a lot of detailed information about the run. This file is most useful for troubleshooting and debugging.
sample1.Log.progress.out: reports job progress statistics, such as the number of processed reads, % of mapped reads etc.
sample1.Log.final.out: summary mapping statistics after mapping job is complete, very useful for quality control.
sample1.Aligned.out.bam - alignments in standard SAM format.
sample1.SJ.out.tab- only those reads that contain junctions.
- sample1.Unmapped.out.mate1- output of unmapped and partially mapped (i.e. mapped only one mate of a paired end read) reads in separate file(s)
- sample1.Unmapped.out.mate2-output of unmapped and partially mapped (i.e. mapped only one mate of a paired end read) reads in separate file(s)
More information on the tool can be found here - https://github.com/alexdobin/STAR