Evolinc in the Discovery Environment

Evolinc in the Discovery Environment

Introduction and Overview

Author(s): Dr. Upendra Kumar Devisetty, CyVerse/University of Arizona and Dr. Andrew D. L. Nelson, School of Plant Sciences, University of Arizona

Goal

It is becoming increasingly apparent that a surprisingly large proportion of eukaryotic transcriptomes are comprised of long non-coding transcripts (or long non-coding RNAs). The identity, function, and the depth of evolutionary conservation of lncRNAs are relative unknowns in many systems. Thus, we have designed Evolinc to streamline lncRNA identification, as well as assist in identifying lncRNAs that are conserved at the genomic or transcriptomic level, thereby creating a set of conserved lncRNAs to probe for function. 

Evolinc is a two-part workflow (Evolinc-I and Evolinc-II) to identify lncRNAs from an assembled transcriptome file (GTF output from Cuffmerge/Cuffcompare) and then determine the extent to which those lncRNAs are conserved in the genome and transcriptome of other species.

Overview

Evolinc-I: lncRNA identification

There are billions upon billions of RNA-Seq reads publicly available from which lncRNAs can be identified and more are being generated daily. Thus, what is needed are the computational resources to map and assemble these reads, and then a workflow to reproducibly identify the lncRNAs from these transcripts. Evolinc-I, working within the Discovery Environment (DE), is the perfect solution to identifying lncRNAs in your species of interest.

Nomenclature

While we focus on long intergenic non-coding RNAs (lincRNAs) in this protocol, Evolinc-I also identifies transcripts that may be natural antisense non-coding RNAs (NAT-lncRNAs) or intra-genic/intronic long non-coding RNAs. These latter two classes of lncRNAs require an additional level of skepticism, as their identification is heavily dependent on the type of RNA-Seq used, as well as the transcript assembly software. This is explained further in the FAQs section below. In addition, it is difficult to adequately address the evolution of transcripts that overlap with protein-coding genes, as their conservation is linked to the protein-coding gene itself. Thus, for the bulk of this tutorial, we will focus on lincRNAs.

Evolinc-II: Comparative genomic and transcriptomic analysis of lincRNAs 

Your species of interest may have an overwhelmingly large dataset of uncharacterized lincRNAs. Determining where to start in your hunt for an interesting lincRNA can be difficult. One aspect examined with protein-coding genes is depth of conservation. LincRNA loci are typically poorly conserved. We propose that we can use the poor sequence conservation of lincRNA loci in general to filter large pools of lincRNAs down to a reasonable set of candidates for functional analysis. Depth of conservation, at the sequence level and at the transcriptional level, is the metric by which Evolinc-II helps you determine which lincRNAs to functionally characterize. With the understanding that conservation does not have to imply function, we think that this is the simplest and quickest way to identify target lincRNAs. As always, rigorous in vivo experimentation will be necessary to confirm any findings from these programs. 

Evolinc-I

Methodology

Prerequisites

  1. A CyVerse account (Register for a CyVerse account at https://user.cyverse.org/).

  2. An up-to-date Java-enabled web browser. (Firefox recommended. If you wish to work with your own large datasets and upload them using iCommands, Chrome is not suitable due to its issues in utilizing 64-bit Java.)

  3. Mandatory arguments

    1. Cuffcompare output file in gtf format 

    2. Reference genome file in fasta format 

    3. Reference genome annotation file in gff format 

  4. Optional arguments

    1. Transposable elements (TE) file in fasta format 

    2. Transcription start site (TSS) file in gff format

    3. Known Long non-coding RNA (lincRNA) in gff format

Note

  • Genomes can be imported directly into the DE from the repository using the Import from URL (Upload menu > Import From URL) command. For more information, see Uploading and Importing Data Items within the DE. Typically, genomes come compressed (genome.gzip is common). Search through the apps using the term "unzip" to identify the appropriate tool to decompress your genome files. After importing your genome and genome annotation files, check to make sure that the same naming convention is used in both files for chromosomes/scaffolds. Some repositories use ">1" to refer to chromose 1, while others use ">Chr1". Either case is fine, but both the genome.fasta file and the first column of the genome annotation.gff file should use the same system.

  • The Transposable element dataset can either be from your species of interest or from a family of closely related species. For example, there is a maintained dataset of Brassicaceae transposable elements that can be used to screen putative A. thaliana lncRNAs. If you do not want to filter TE containing transcripts from your dataset, do not include a TE.fasta file.

  • If you have not generated TSS data yourself, there are publicly available datasets of transcription start sites that may be useful, but only for a limited number of species.

  • If you would like to compare your dataset against multiple public datasets of known IncRNAs, merge them into one gff document. If you have a Linux system available, concatenate the files and use Bedtools to sort the gff file. As with genomes and genome annotation, make sure that all chromosome-naming schemes are the same. See the FAQs for additional information.

Test/sample data

This tutorial uses Arabidopsis data (derived from several studies) that is stored in the Data Store at Community Data > iplantcollaborative > example_data > Evolinc.sample.data > Evolinc-I. The sample data include a Cuffcompare output file (sample_cuffcompare_out.gtf), a genomic DNA reference file (TAIR10_chr.fasta), a cDNA reference annotation file (TAIR10_genes.gff), a transposable elements file (TE_RNA_transcripts.fa), a TSS file (Sample_TSS_data.gff), and a known lincRNA file (Known_lincRNAs.gff). 

  • The Reference genome and cDNA annotation files are from the TAIR10 annotation. 

  • The Cuffcompare file is derived from a RNA-Seq study by Slutte, et al., (2012). From this publication, we first acquired the SRA files for Arabidopsis thaliana (from WT flower tissue, Col-0 ecotype), reassembled them using TopHat and Cufflinks within the DE, and then ran Cuffcompare to generate the final output. 

  • TSS data was kindly provided by Dr. Molly Megraw at Oregon State University. 

  • The known lincRNAs are derived from Liu, et al., 2012 (Plant Cell).

Starting an Evolinc-I job in the DE

  1. Open the DE Apps window and search for Evolinc-I v1.0.

  2. In the Analysis Name: Evolinc-I panel:

    1. Change the name for your analysis (optional).

    2. Enter any comments (optional).

    3. In the Select output folder field, click Browse and navigate to the folder of your choice. You can leave the default name iplant/home/username/analyses.

    4. To retain copies of the input files in your analysis results output folder, click the Retain Inputs checkbox.

  3. Click the Inputs panel:

    1. For the Cuffcompare output file, browse to select Sample_cuffcompare_out.gtf.

    2. For the Reference Genome file, browse to select TAIR10_chr.fasta.

    3. For the Reference Transcriptome annotation file, browse to select TAIR10_genes.gff.

    4. For the Transposable elements file (optional), browse to select TE_RNA_transcripts.fa.

    5. For the TSS file (optional), select Sample_TSS.gff.

    6. Optional: For the Known lincRNA file, browse to select Sample_known_lncRNAs.gff.

  4. Click the Output panel and enter the name of the output folder or you can leave the default name.

  5. Click Launch Analysis.

Output from Evolinc-I

The Evolinc-I workflow analysis produces a folder named Evolinc-I_output (default), containing several output files and an output folder:

  • lincRNAs.fa - Final Long intergenic ncRNA transcripts in fasta format (contains all isoforms for each lincRNA locus)

  • lincRNAs.bed -  Final Long intergenic ncRNA transcripts in bed format

  • lincRNA_updated.gtf - Final updated Cuffcompare output with the final Long intergenic ncRNA transcripts

  • lincRNA_demographics.txt 

    Output file 1: A text file showing the demographics of identified lincRNA 

  • final_Summary_table.tsv 

          Output file 2: A csv file showing the summary of each of the lincRNA generated