B) Reduce transcript redundancy

Reduce transcript redundancy (app: CD-HIT-est 4.6.1)

Description: CD-HIT stands for Cluster Database at High Identity with Tolerance. The program (cd-hit) takes a fasta format sequence database as input and produces a set of 'non-redundant' (nr) representative sequences as output. In addition cd-hit outputs a cluster file, documenting the sequence 'groupies' for each nr sequence representative. The idea is to reduce the overall size of the transcriptome without removing any sequence information by only removing 'redundant' (or highly similar) sequences. This is why the resulting file is called non-redundant (nr). Essentially, cd-hit produces a set of closely related protein families from a given fasta sequence database. Documentation: http://weizhong-lab.ucsd.edu/cd-hit/.

Log into the Discovery Environment: https://de.iplantcollaborative.org/de/.
Open the CD-HIT-est 4.6.1 app (Public Applications > NGS > Assembly Annotation > CD-HIT-est 4.6.1).
1. Change 'Analysis Name' to Reduce_Transcript_Redundancy, add a 'Description' (optional), and use the default 'output folder'.
Click on the Settings tab.
1. Click in the 'Transcript fasta file' field. Browse to the folder that holds the FASTA file containing the contig sequences (Sample data: Community Data > iplant_training > rna-seq_without_genome > B_reduce_transcript_redundancy > BAtranscriptome_300andup.fa). Select the FASTA file, then click on OK.
2. Rename the 'Output file name' to 'BAtranscriptome_reduced.fa'.
3. Click in the 'Global sequence identify' field and enter 0.94.
Click on "Launch Analysis".
Click on 'Analyses' from the DE workspace and monitor the 'Status' of the analysis (e.g., Idle, Submitted, Pending, Running, Completed, Failed).
1. Once launched, an analysis will continue whether the user remains logged in or not.
2. Email notifications update on the analysis progress; they can be switched off under 'Preferences'.
3. If the analysis fails or does not proceed in the anticipated timeline, check these tips for troubleshooting. (Using the sample data, the analysis should be complete in less than 5 minutes.)
4. To re-run an analysis, click the analysis "App" in the 'Analyses' window.
Access analysis results in one of two ways:
1. In the 'Analyses' window click on the analysis "Name" to open the output folder.
2. In the 'Data' window, click on user name, then navigate to the folder that holds the output of the analysis. (Find the output for the sample at Community Data > iplant_training > rna-seq_without_genome > B_reduce_transcript_redundancy > output_from_sample_data.)
The output will include a new, consolidated transcript sequence file in fasta format (BAtranscriptome_reduced.fa), and a file that lists the groupings for the original transcript sequences. The consolidated sequences are now ready to be used as the transcriptome reference for RNA-Seq studies.