Picard-MarkDuplicates-2.7.1 in the Discovery Environment

Alert:

 

The iPlant App Store is currently being restructured, and apps are being moved to an HPC environment. During this transition, users may occasionally be unable to locate or use apps that are listed in our tutorials. In many cases, these apps can be located by searching them using the search bar at the top of the Apps window in the DE. To increase the chance for search success, try not searching the entire app name and version number but only the portion that refers to the app's function or origin (e.g. 'SOAPdenovo' instead of 'SOAPdenovo-Trans 1.01'). In critical cases, please report your concern to the iPlant Ask forum or to support@iplantcollaborative.org. Thank you for your patience.

The DE Quick Start tutorial provides an introduction to basic DE functionality and navigation.

Please work through the tutorial and add your comments on the bottom of this page. Or send comments per email to upendra@cyverse.org. Thank you.

Rationale and background:

Picard-MarkDuplicates-2.7.1 identifies duplicate reads. This app locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating from a single fragment of DNA. Duplicates can arise during sample preparation e.g. library construction using PCR. See also EstimateLibraryComplexity for additional notes on PCR duplication artifacts. Duplicate reads can also result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are referred to as optical duplicates.The Picard-MarkDuplicates-2.7.1 app works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file.  After duplicate reads are collected, the tool differentiates the primary and duplicate reads using an algorithm that ranks reads by the sums of their base-quality scores (default method). The app main output is a new SAM or BAM file, in which duplicates have been identified in the SAM flags field for each read. Duplicates are marked with the hexadecimal value of 0x0400, which corresponds to a decimal value of 1024. If you are not familiar with this type of annotation, please see the following blog post for additional information.

Picard-MarkDuplicates-2.7.1 also produces a metrics file indicating the numbers of duplicates for both single- and paired-end reads.The program can take either coordinate-sorted or query-sorted inputs, however the behavior is slightly different. When the input is coordinate-sorted, unmapped mates of mapped records and supplementary/secondary alignments are not marked as duplicates. However, when the input is query-sorted (actually query-grouped), then unmapped mates and secondary/supplementary reads are not excluded from the duplication test and can be marked as duplicate reads.


Pre-Requisites

  1. A CyVerse account. (Register for an CyVerse account here - user.cyverse.org)
  2. Mandatory arguments 
    1. Input file: One or more input SAM or BAM files to analyze. Must be coordinate sorted. 
    2. Output file: The output file to write marked records.
    3. Metrics file: File to write duplication metrics to.
  3. Parameters
    1. Validation stringency: For all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. Default value: STRICT. This option can be set to 'null' to clear the default value. Possible values: {STRICT, LENIENT, SILENT}
    2. Remove Duplicates: If true do not write duplicates to the output file instead of writing them with appropriate flags set. Default value: false.

Test with sample data

The following test data are provided for testing Picard-MarkDuplicates-2.7.1 in here -/iplant/home/shared/iplantcollaborative/example_data/picard/MarkDuplicates. Execute Picard-MarkDuplicates-2.7.1 with the following input data

  1. Input file: sample.sorted.bam
  2. Output file: dedupped_out.bam
  3. Metrics file: marked_dup_metrics.txt
  4. Validation stringency: STRICT
  5. Remove Duplicates: True

Results 

Outputs form Picard-MarkDuplicates-2.7.1 are dedupped_out.bam and marked_dup_metrics.txt

More information on the Picard-MarkDuplicates-2.7.1 can be found in this manual