R) Generate transcript lists

Generate transcript lists (app: Cut Columns)

Description: The Cut Columns app combines selected columns from a table or tab-delimited file into a new table. Here, Cut Columns is used to generate separate lists of transcript names for up-regulated and for down-regulated transcripts. Documentation: http://www.gnu.org/software/coreutils/manual/html_node/cut-invocation.html.

Log into the Discovery Environment: https://de.iplantcollaborative.org/de/.
Rename the output files from the previous section (Separate transcripts by type) so that each file has a different name (Sample data: Rename the files to numeric_filter_out_up.txt for the up-regulated transcripts and numeric_filter_out_down.txt for the down-regulated transcripts).
Open the Cut Columns app (Public Applications > General Utilities > Text and Tabular Data > Cut Columns).
1. Change 'Analysis Name' to Generate_Transcript_List_Up, add a 'Description' (optional), and use the default 'output folder'.
Click on the Input data tab.
1. Select the 'Select a tabular text data file' field. Enter the numeric_filter_out.txt file from the previous section (Separate transcripts by type) for the up-regulated transcripts (Sample data: Community Data > iplant_training > rna-seq_without_genome > R_generate_transcript_lists > numeric_filter_out_up.txt).
Click on the Options tab.
1. Under the 'Indicate the delimiter for the data file' field leave the delimiter set to 'Tab'.
2. Under the 'Enter comma-separated list of columns to extract (ie. c1,c3)' field, enter c2 for the comma-separated list of columns to extract.
Click on "Launch Analysis".
Repeat this analysis with the numeric_filter_out.txt file resulting from down-regulated transcripts (Sample data: Community Data > iplant_training > rna-seq_without_genome > R_generate_transcript_lists > numeric_filter_out_down.txt), changing the 'Analysis Name' to Generate_Transcript_List_Down.
Click on 'Analyses' from the DE workspace and monitor the 'Status' of the analysis (e.g., Idle, Submitted, Pending, Running, Completed, Failed).
1. Once launched, an analysis will continue whether the user remains logged in or not.
2. Email notifications update on the analysis progress; they can be switched off under 'Preferences'.
3. If the analysis fails or does not proceed in the anticipated timeline, check these tips for troubleshooting. (Using the sample data, the analysis should be complete in less than 5 minutes.)
4. To re-run an analysis, click the analysis "App" in the 'Analyses' window.
Access analysis results in one of two ways:
1. In the 'Analyses' window click on the analysis "Name" to open the output folder.
2. In the 'Data' window, click on user name, then navigate to the folder that holds the output of the analysis. (Find the output for the sample at Community Data > iplant_training > rna-seq_without_genome > R_generate_transcript_lists > output_from_sample_data > cutWrapper_out.txt.)
The output files contain a list of either up-regulated or down-regulated transcripts.

Next Steps

Alternative analysis

Section Q
If the researcher wants to try to do more annotation for the transcripts that are up-regulated by a factor of 2x or more, then they could repeat the numeric evaluation step (Section Q) with the threshold set to greater than or equal to 2. Then this alternative output can be used to create a list of transcript names using Cut Columns, and the peptide fasta file for coded peptide could be used with the Select contigs App to produce a 2x-up-regulated peptide sequence file. Those selected peptide sequences could then be submitted for InterProScan analysis with the InterProScan App (Public Applications > High-Performance Computing > InterProScan 5.8.49.0), an annotation tool.

Section R
The user who wants to create lists of mapped refseq_protein ids for the up-regulated and down-regulated transcripts can retrieve those by repeating this step with the delimiter set to Pipe, and for the columns to extract enter c4. It is important to remember that the resulting lists of ids are for the closest matching protein for another species represented in the refseq_protein database.

Annotations

Gene Ontology
Additional steps for this analysis could include taking a list of upregulated or downregulated id’s for refseq proteins and use them for Gene Ontology (GO) annotation and analysis. This may be easier to perform if the organism can be mapped against one main reference genome sequence (e.g. D. melanogaster) as opposed to a mixture protein id’s from diverse species.

Blastp
Alternative annotation methods can be used for determining a putative identity or function for the assembled transcripts. Blastp could be used for peptide to peptide mapping against the ref_seq database, but it is slower than Blat, despite blastp’s multithreaded algorithm compared to Blat’s single thread. A tutorial for running blastp is available here.

Visualization

Atmosphere
RNA-Seq data can be visualized using iPlant's Atmosphere service. In place of a reference genome, use the transcriptome that was generated in previous sections.

CoGe (Documentation: https://genomevolution.org/CoGe/)
Requires uploading a transcriptome in the place of a reference genome.