De novo transcriptome analysis for non-model organisms using Discovery Environment and Atmosphere

Navigate space
 

Abstract

This workflow allows novice researchers to leverage advanced computational resources such as cloud computing to carry out pairwise comparative transcriptomics. It also serves as a primer for biologists to develop data scientist computational skills, e.g. executing bash commands, visualization and management of large data sets. All command line code and further explanations of each command or step can be found on the wiki (https://pods.iplantcollaborative.org/wiki/x/JgyEAQ). The Discovery Environment and Atmosphere platforms are connected together through the CyVerse Data Store. As such, once the initial raw sequencing data has been uploaded there is no more need to transfer large data files over an Internet connection, minimizing the amount of time needed to conduct analyses. This protocol is designed to analyze only two experimental treatments or conditions. Differential gene expression analysis is conducted through pairwise comparisons, and will not be suitable to test multiple factors. This workflow is also designed to be manual rather than automated. Each step must be executed and investigated by the user, yielding a better understanding of data and analytical outputs, and therefore better results for the user. Once complete, this protocol will yield de novo assembled transcriptome(s) for underserved (non-model) organisms without the need to map to previously assembled reference genomes (which are usually not available in underserved organism). These de novo transcriptomes are further used in pairwise differentially gene expression analysis to investigate genes differing between two experimental conditions. Differentially expressed genes are then be functionally annotated to understand the genetic response organisms have to experimental conditions. In total, the data derived from this protocol is used to test hypotheses about biological responses of underserved organisms.

Protocol

Note: The overall protocol has been numbered according to folders that will be created and named in step 1.2 (Figure 1 and 2). This protocol represents a standard comparative de novo transcriptome analysis, and every step detailed here may not be necessary for all researchers. This workflow is documented thoroughly on a companion tutorial wiki, which also contains all additional files and links to documents of interest (https://pods.iplantcollaborative.org/wiki/x/JgyEAQ). Additional training materials can also be found on the wiki (https://pods.cyverse.org/wiki/) or on documentation written by 3rd party developers for each analysis package (Table 1). Links to this material will be included throughout this protocol for easy access to this information. Best practices are notes provided to users as suggestions for the best way to accomplish tasks or for users to consider, and will be communicated through notes in the protocol. A folder of example data input and analytical output is publicly available to users, and is organized as suggested in the protocol (https://de.iplantcollaborative.org/de/?type=data&folder=/iplant/home/shared/Trinity_transdecoder_trinotate_databases). The example data will help researchers and educators learn and teach de novo transcriptome assembly and analysis. 

Example Data

  1. Fungi
    1. Plant Pathogen
      1. Isolate 10 genotype
      2. Isolate 11 genotype
  2. Invertebrates
    1. Swimmy Invertebrates
      1. Control
      2. Treatment
    2. Stationary Invertebrates
      1. Population 1
      2. Population 2
      3. Population 3
  3. Plants

1. Set up the project, upload raw sequencing reads, and assess reads using FASTQC

1.1) Get access to Atmosphere and the Discovery Environment.

1.1.1) Request a free CyVerse account by navigating to the registration page (https://user.iplantcollaborative.org/register/). Use an Institutional email to register for the account (e.g. person@institution.edu). 

1.1.2) Fill in the required information and submit. 

1.1.3) Navigate to the main webpage (http://www.cyverse.org/), and sign in at the top toolbar. 

1.1.4) Navigate to the Apps & Services tab, and request access to Atmosphere. Access to the Discovery Environment is automatically granted.

1.2) Set up the project and move data to the Data Store.

1.2.1) Log into the Discovery Environment (https://de.iplantcollaborative.org/de). Select the “Data” tab to bring up a menu containing all the folders in the Data Store. 

1.2.2) Create a main project folder that will house all of the data associated with the project. Find the toolbar at the top of the data window and select File | New Folder. Do not use spaces or special characters in the folder names or any input/output file names e.g. “!@#()[]{}:;$%^&*.” Instead, use underscores or dashes, i.e. “_” or “-“ where appropriate. 

1.2.3) Create five folders within the main project folder to organize analyses (Figure 1) Name the folders as follows without commas or quotation marks: “1_Raw_Sequence,” “2_High_Quality_Sequence,” “3_Assembly,” “4_Differential_Expression,” “5_Annotated_Assembly.” Subfolders will be placed into each of these main project folders (Figure 2). [Place Figure 1 here].[Place Figure 2 here]. 

1.3) Upload raw FASTQ sequence files into the folder “1_Raw_Sequence” into a subfolder entitled “A_Raw_Reads” using one of the following three methods. 

1.3.1) Use the Data Store simple upload feature navigate to the Data window toolbar by clicking on the data button in the main DE desktop, and select Upload | Simple Upload from Desktop. Select the Browse button to navigate to the raw FASTQ sequencing files on the local computer. This method is only suitable for files under 2 GB. 

1.3.2) Select the Upload button at the bottom of the screen to submit the upload. A notification will register in the top right of the DE in the bell icon that the upload has been submitted. Another notification will register when the upload is complete. 

1.3.3) Alternatively, Cyberduck can be used to transfer larger files (https://pods.iplantcollaborative.org/wiki/x/pYcVAQ). Cyberduck must be installed and then run like a program on the local computer’s desktop. 

1.3.4) Lastly, download iCommands and install onto the local computer according to instructions (https://pods.iplantcollaborative.org/wiki/display/DS/Using+iCommands). 

1.4) Assess uploaded, raw sequencing reads using the FASTQC app in the DE.

1.4.1) Select the “Apps” button on the main DE desktop to open a window containing all of the analysis apps available in the DE. 

1.4.2) Search and open the window for the FASTQC tool in the search toolbar at the top of the window. Open the multi-file version if there is more than one FASTQ file. Select File | New Folder to create a folder named “B_FASTQC_Raw_Reads” and select this folder as the output folder. 

1.4.3) Load the FASTQ read files into the tool window called “Select input data” and select “Launch Analysis.” 

1.4.4) Open the .html or .pdf file to view the results once the analysis is complete. FASTQC runs several analyses that test different aspects of the read files (Figure 3).

2. Trim and quality filter raw reads to yield high quality sequence

2.1) Search for the programmable Trimmomatic app in the DE and open it as before.

2.1.1) Upload the folder of raw FASTQ read files into the “Settings” section.

2.1.2) Select whether the sequencing files are single- or paired-end. 

2.1.3) Use the standard control file provided by selecting the Browse button and pasting /iplant/home/shared/Trinity_transdecoder_trinotate_databases into the “Viewing:” box. Select the file named Trimmomaticv0.33_control_file and launch the analysis. The file can be downloaded, the settings edited, and then uploaded into the second project folder to create a custom trimming script. 

2.1.4) Optional: if the FASTQC analysis identified adapter sequences the ILLUMINACLIP setting can be used to trim Illumina adapters. Select the appropriate adapter file in the folder /iplant/home/shared/Trinity_transdecoder_trinotate_databases as above. 

2.2) Quality trimming sequence reads using Sickle.

2.2.1) Search and open the Sickle app in the DE. Select the trimmed FASTQ reads as input reads, and rename output files. Include quality settings in the options. Typical settings are Quality format: illumina, sanger, solexa; Quality threshold: 20; Minimum length: 50. 

2.2.2) Move all output into the trimmed and filtered folder (2_High_Quality_Sequence). 

2.3) Assess the final reads using FASTQC and compare to previous FASTQC reports.

Select the .html file to bring up a webpage of all results. Select the folder of image files (.png) that are provided in the output if that cannot be viewed. 

3. De novo transcriptome assembly using Trinity in Atmosphere

3.1) Open the most current version of the Atmosphere instance

Navigate to the Trinity wiki page and select the link for the most recent version of the Trinity and Trinotate image.

  1. Trinity 2.1.1 and Trinotate 2.0.2

Alternatively, search “Trinity” in the Atmosphere image search tool to bring up all versions of the Trinity and Trinotate images. 

3.1.1) Select the “Log in to launch” button and then name the Atmosphere instance. 

3.1.2) Select an instance size of either “medium3” (CPU: 4, Mem: 32GB) or “large3” (CPU: 8, Mem: 64 GB). Launch the instance, and wait for it to build. In some rare cases CyVerse undergoes maintenance to update platforms. Existing instances are available during these updates, but it may not be possible to create new instances. Visit the CyVerse Status page to see the current state of any platform.

3.2) Open the instance once it is ready by clicking on the name and then selecting “Remote Desktop” on the bottom of the menu on the right.

Note: Allow Java and VNC Viewer if asked. Select the “Connect” button in the VNC Viewer window, and then select “Continue.” 

3.2.1) Log in to open a separate window that will be the new cloud computing instance. 

Initialize iCommands for Data Transfer

  • iinit
  • Host: data.iplant.collaborative.org
  • Port: 1247
  • User: your_CyVerse_username
  • Zone: iplant
  • Password: your_CyVerse_password

3.2.2) Move the trimmed and/or filtered FASTQ read files into the instance using one of the three methods described in steps 1.3.1-1.3.4. Use the Internet browser to access the DE and download files just as before on the local computer. Or use iCommands installed on these images to quickly transfer large data sets. 

3.3) Running Trinity to assemble high quality reads.

3.3.1) Set up the analysis folder on the Atmosphere instance. Use the app available in the DE (/iplant/home/shared/Trinity_transdecoder_trinotate_databases) or copy and paste the commands below.

Run Trinity Assembly 

For Atmosphere instances that are “medium3” (CPU: 4, Mem: 32GB)

  • Trinity --seqType fq --left reads_1.fq --right reads_2.fq --CPU 4 --max_memory 30G

For Atmosphere instances that are “large3” (CPU: 8, Mem: 64 GB)

  • Trinity --seqType fq --left reads_1.fq --right reads_2.fq --CPU 8 --max_memory 60G

To test example data in Trinity on Atmosphere instances that are “large3” (CPU: 8, Mem: 64 GB)

  • cd /tools/trinityrnaseq-2.1.1/sample_data/test_Trinity_Assembly/
  • sudo Trinity --seqType fq --left reads.left.fq --right reads.right.fq --CPU 8 --max_memory 60G

3.3.2) Once the analysis folder and the Trinotate databases are established, run the Trinity assembler using the commands from above. There are several output files, but the most important is the final assembly file entitled “Trinity.fasta.” Rename this FASTA file to be unique to the organism and treatment of the assembled reads before moving it into the Data Store (folder 3_Assembly) to minimize potential confusion. 

3.3.3) Output counts tables for differential gene expression analysis into a folder (4_Differential_Expression).

  1. Navigate to the Discovery Environment
  2. Open the Kallisto workflow
    1. Add the assembly
    2. Add raw reads for quantification

3.4) Assess the assembly using rnaQUAST

3.4.1) Move the Trinity output files into the folder “3_Assembly” in the DE and label the folder “A_Trinity_de_novo_assembly.” Give each transcriptome that was assembled a subfolder inside the “A_Trinity_de_novo_assembly” folder with unique names including the scientific name of organisms and treatments associated with each transcriptome. Create another subfolder called “B_rnaQUAST_Output” in the “3_Assembly folder.” 

3.4.2) Open the app titled “rnaQUAST 1.2.0 (denovo based)” and name the analysis and select “B_rnaQUAST_Output” as the output folder. 3.4.2.1) Add the de novo assembly FASTA file(s) to the “Data Input” section. In the “Data Output” section, type a unique name for the de novo assembly. This will create a folder of rnaQUAST output files inside of the folder “B_rnaQUAST_Output.” 

3.4.3) Select additional options in the “GenemarkS-T Gene Prediction,” “BUSCO,” and “Parameters” sections. 

3.4.3.1) Select prokaryote in the “GenemarkS-T Gene Prediction” section if the organism is not eukaryotic. 

3.4.3.2) Run BUSCO to select the browse button and copy the path iplant/home/shared/iplantcollaborative/example_data/BUSCO.sample.data into the “Viewing:” box and press enter. Select the most specific BUSCO folder that is available for the organism.  Note: BUSCO will assess the assembly for lineage-specific core genes, and output what percentage of core genes is found. There are general folders, e.g. eukaryote, and more specific lineages, e.g. arthropoda. 

3.5) Search for “Transcript decoder” and run Transdecoder on the de novo Trinity assembly output FASTA file in the Discovery Environment. 

3.6) Move the output .pep file into the de novo assembly (3_Assembly) folder for use in step 5 Annotation. 

4. Pairwise differential expression using DESeq2 in the DE 

For this section you will need to generate a counts table. There are many ways and tools to do this: RSEM, kallisto, salmon, and many more

4.1) Open the DESeq2 app in the DE as described previously. Name the analysis and select the output folder as 4_Differential_Expression. 

4.2) In the “Inputs” section, select the counts table from the Trinity assembly run and the column the contig names can be found in that counts table. 

4.3) Input the column headers from the counts data table file to determine which columns are compared. Include the commas between each of the conditions.

Do not include the first column header that contains the contig names. For replicates, repeat the same name (e.g., Treatment1rep1, Treatment1rep2, Treatment1rep3 would become Treatment1, Treatment1, Treatment1). In the second line, provide the names of the two conditions to be compared (e.g., Treatment1, Treatment2). Match the column header names provided in the first line. Note: These column headers must be alphanumeric and cannot contain any special characters. 

5. Annotation using Trinotate

5.1) Run each part of Trinotate in the Atmosphere cloud computing instance.

5.1.1) Run the bash command for searching Trinity transcripts. Change the number of threads to match how many CPUs are on the instance, i.e. medium has 4 CPUs and large has 8 CPUs. Refer to step 3.1.2 for more details. Change the command Trinity.fasta to match the assembly FASTA file name.

BASH Commands

Note: BLAST+ searches will require the most time. It may be days before it completes. The cloud computer activity can be checked in Atmosphere without having to bring up the VNC Viewer.

5.1.2) Run the bash command for searching Transdecoder-predicted proteins. As before, change the threads number and file name to match the conditions in 5.2.1. 

5.2.3) Run the bash command for HMMER and change the number of threads as above. 5.2.4) Run the bash command for signalP and tmHMM if needed. SignalP will predict signal peptides and tmHMM predicts transmembrane protein motifs. 

5.2) Loading results into the SQLite database

5.2.1) Once all of the above analyses are completed, run the bash command to load output files into a final SQLite annotation database. Remove any commands for analyses that were not run.

5.2.2) Export the SQLite database into a .xls file for viewing in popular table viewers.

Unable to render {include} The included page could not be found.