C) Identify coding sequences

Identify coding sequences (app: Transcript decoder 1.0)

Description: TransDecoder identifies candidate coding regions within transcript sequences, such as those generated by de novo RNA-Seq transcript assembly using Trinity, or constructed based on RNA-Seq alignments to the genome using Tophat and Cufflinks. Documentation: http://transdecoder.github.io/.

Log into the Discovery Environment: https://de.iplantcollaborative.org/de/.
Open the Transcript decoder 1.0 app (Public Applications > NGS > Assembly Annotation > Transcript decoder 1.0).
1. Change 'Analysis Name' to Identify_Coding_Sequences, add a 'Description' (optional), and use the default 'output folder'.
Click on the Main settings tab.
1. Click on the 'transcript file input' field. Browse to the folder that holds the FASTA file containing the contig sequences (Sample data: Community Data > iplant_training > rna-seq_without_genome > C_identify_coding_sequences > BAtranscriptome_reduced.fa). Select the file, then click on OK.
2. The rest of the settings can be left at their default values.
Click on "Launch Analysis".
Click on 'Analyses' from the DE workspace and monitor the 'Status' of the analysis (e.g., Idle, Submitted, Pending, Running, Completed, Failed).
1. Once launched, an analysis will continue whether the user remains logged in or not.
2. Email notifications update on the analysis progress; they can be switched off under 'Preferences'.
3. If the analysis fails or does not proceed in the anticipated timeline, check these tips for troubleshooting. (Using the sample data, the analysis should be complete in less than 15 minutes.)
4. To re-run an analysis, click the analysis "App" in the 'Analyses' window.
Access analysis results in one of two ways:
1. In the 'Analyses' window click on the analysis "Name" to open the output folder.
2. In the 'Data' window, click on user name, then navigate to the folder that holds the output of the analysis. (Find the output for the sample at Community Data > iplant_training > rna-seq_without_genome > C_identify_coding_sequences > output_from_sample_data.)
The Transcript decoder output files include the coding sequences found in the transcript sequences (best_candidates_eclipsed_orfs_removed.cds), their matching peptide sequences (best_candidates_eclipsed_orfs_removed.pep), and .gff and .bed files that define the locations for the various transcripts and their exons within the transcript sequences. There are two versions for each of these files, one for all the coding sequences found, and another for the “best_candidates.eclipsed_orfs_removed” group, where the coding sequences found within larger coding sequences are removed as duplicates.