Fastq-screen-0.11.1

Please work through the documentation and add your comments on the bottom of this page, or email comments to support@cyverse.org. Thank you.


FastQ Screen is a simple application which allows users to search a large sequence dataset against a panel of different genomes to determine from where the sequences in your data originate. It was built as a QC check for sequencing pipelines but may also be useful in characterizing metagenomic samples. When running a sequencing pipeline it is useful to know that your sequencing runs contain the types of sequence they're supposed to. Your search libraries might contain the genomes of all of the organisms you work on, along with PhiX, Vectors or other contaminants commonly seen in sequencing experiments
.

Although the program wasn't built with any particular technology in mind it is probably only really suitable for processing short reads due to the use of either Bowtie, Bowtie2 or BWA as the searching application.

The program generates both text and graphical output to inform you what proportion of your library was able to map, either uniquely or to more than one location, against each of your specified reference genomes. The user should therefore be able to identify a clean sequencing experiment in which the overwhelming majority of reads are probably derived from a single genomic origin

Test Data

All files are located in the Community Data directory of the iPlant Discovery Environment at the following path: Community Data > iplantcollaborative > example_data > fastqc_screen > Inputs (/iplant/home/shared/iplantcollaborative/example_data/fastq_screen/Inputs/)

 

Input File(s)

Reads - fqs_test_dataset.fastq.gz

Reference Genomes - E_coli.fa, Mus_musculus.GRCm38.dna.chromosome.1.fa, lambda_virus.fa

Parameters

  • Aligner - BOWTIE2
  • Subset - 100000
  • The rest as defaults

Output - Output

Aligner: Specify the aligner to use for the mapping. Valid arguments are 'bowtie', bowtie2' or 'bwa'

illumina1_3 : Assume that the quality values are in encoded in Illumina v1.3 format. Defaults to Sanger format if this flag is not specified.

outdir: Specify a directory in which to save output files. If no directory is specified then output files are saved into the same directory as the input file.

subset <int> : Don't use the whole sequence file, but create a temporary dataset of this specified number of reads. The dataset created will be of approximately (within a factor of 2) of this size. If the real dataset is smaller than twice the specified size then the whole dataset will be used. Subsets will be taken evenly from throughout the whole original dataset. By Default FastQ Screen runs with this parameter set to 100000. To process an entire dataset however, adjust --subset to 0.

Output File(s)

  • Expect the following as outputs (in addition to the logs generated for all analyses)
    • Output folder that contains three files
      • fqs_test_dataset_screen.html
      • fqs_test_dataset_screen.png
      • fqs_test_dataset_screen.txt
  • fqs_test_dataset_screen.png

#Fastq_screen version: 0.11.1	#Aligner: bowtie2	#Reads in subset: 100000
Genome	#Reads_processed	#Unmapped	%Unmapped	#One_hit_one_genome	%One_hit_one_genome	#Multiple_hits_one_genome	%Multiple_hits_one_genome	#One_hit_multiple_genomes	%One_hit_multiple_genomes	Multiple_hits_multiple_genomes	%Multiple_hits_multiple_genomes
Mus_musculus.GRCm38.dna.chromosome.1	1000	739	73.90	73	7.30	188	18.80	0	0.00	0	0.00
E_coli	1000	1000	100.00	0	0.00	0	0.00	0	0.00	0	0.00
lambda_virus	1000	1000	100.00	0	0.00	0	0.00	0	0.00	0	0.00
%Hit_no_genomes: 73.90

Tool Source for App