Evolinc in the Discovery Environment
Introduction and Overview
Author(s): Dr. Upendra Kumar Devisetty, CyVerse/University of Arizona and Dr. Andrew D. L. Nelson, School of Plant Sciences, University of Arizona
Goal
It is becoming increasingly apparent that a surprisingly large proportion of eukaryotic transcriptomes are comprised of long non-coding transcripts (or long non-coding RNAs). The identity, function, and the depth of evolutionary conservation of lncRNAs are relative unknowns in many systems. Thus, we have designed Evolinc to streamline lncRNA identification, as well as assist in identifying lncRNAs that are conserved at the genomic or transcriptomic level, thereby creating a set of conserved lncRNAs to probe for function.
Evolinc is a two-part workflow (Evolinc-I and Evolinc-II) to identify lncRNAs from an assembled transcriptome file (GTF output from Cuffmerge/Cuffcompare) and then determine the extent to which those lncRNAs are conserved in the genome and transcriptome of other species.
Overview
Evolinc-I: lncRNA identification
There are billions upon billions of RNA-Seq reads publicly available from which lncRNAs can be identified and more are being generated daily. Thus, what is needed are the computational resources to map and assemble these reads, and then a workflow to reproducibly identify the lncRNAs from these transcripts. Evolinc-I, working within the Discovery Environment (DE), is the perfect solution to identifying lncRNAs in your species of interest.
Nomenclature
While we focus on long intergenic non-coding RNAs (lincRNAs) in this protocol, Evolinc-I also identifies transcripts that may be natural antisense non-coding RNAs (NAT-lncRNAs) or intra-genic/intronic long non-coding RNAs. These latter two classes of lncRNAs require an additional level of skepticism, as their identification is heavily dependent on the type of RNA-Seq used, as well as the transcript assembly software. This is explained further in the FAQs section below. In addition, it is difficult to adequately address the evolution of transcripts that overlap with protein-coding genes, as their conservation is linked to the protein-coding gene itself. Thus, for the bulk of this tutorial, we will focus on lincRNAs.
Evolinc-II: Comparative genomic and transcriptomic analysis of lincRNAs
Your species of interest may have an overwhelmingly large dataset of uncharacterized lincRNAs. Determining where to start in your hunt for an interesting lincRNA can be difficult. One aspect examined with protein-coding genes is depth of conservation. LincRNA loci are typically poorly conserved. We propose that we can use the poor sequence conservation of lincRNA loci in general to filter large pools of lincRNAs down to a reasonable set of candidates for functional analysis. Depth of conservation, at the sequence level and at the transcriptional level, is the metric by which Evolinc-II helps you determine which lincRNAs to functionally characterize. With the understanding that conservation does not have to imply function, we think that this is the simplest and quickest way to identify target lincRNAs. As always, rigorous in vivo experimentation will be necessary to confirm any findings from these programs.
Evolinc-I
Methodology
Prerequisites
- A CyVerse account (Register for a CyVerse account at https://user.cyverse.org/).
- An up-to-date Java-enabled web browser. (Firefox recommended. If you wish to work with your own large datasets and upload them using iCommands, Chrome is not suitable due to its issues in utilizing 64-bit Java.)
- Mandatory arguments
- Cuffcompare output file in gtf format
- Reference genome file in fasta format
- Reference genome annotation file in gff format
- Optional arguments
- Transposable elements (TE) file in fasta format
- Transcription start site (TSS) file in gff format
- Known Long non-coding RNA (lincRNA) in gff format
Note
- Genomes can be imported directly into the DE from the repository using the Import from URL (Upload menu > Import From URL) command. For more information, see Uploading and Importing Data Items within the DE. Typically, genomes come compressed (genome.gzip is common). Search through the apps using the term "unzip" to identify the appropriate tool to decompress your genome files. After importing your genome and genome annotation files, check to make sure that the same naming convention is used in both files for chromosomes/scaffolds. Some repositories use ">1" to refer to chromose 1, while others use ">Chr1". Either case is fine, but both the genome.fasta file and the first column of the genome annotation.gff file should use the same system.
- The Transposable element dataset can either be from your species of interest or from a family of closely related species. For example, there is a maintained dataset of Brassicaceae transposable elements that can be used to screen putative A. thaliana lncRNAs. If you do not want to filter TE containing transcripts from your dataset, do not include a TE.fasta file.
- If you have not generated TSS data yourself, there are publicly available datasets of transcription start sites that may be useful, but only for a limited number of species.
- If you would like to compare your dataset against multiple public datasets of known IncRNAs, merge them into one gff document. If you have a Linux system available, concatenate the files and use Bedtools to sort the gff file. As with genomes and genome annotation, make sure that all chromosome-naming schemes are the same. See the FAQs for additional information.
Test/sample data
This tutorial uses Arabidopsis data (derived from several studies) that is stored in the Data Store at Community Data > iplantcollaborative > example_data > Evolinc.sample.data > Evolinc-I. The sample data include a Cuffcompare output file (sample_cuffcompare_out.gtf), a genomic DNA reference file (TAIR10_chr.fasta), a cDNA reference annotation file (TAIR10_genes.gff), a transposable elements file (TE_RNA_transcripts.fa), a TSS file (Sample_TSS_data.gff), and a known lincRNA file (Known_lincRNAs.gff).
- The Reference genome and cDNA annotation files are from the TAIR10 annotation.
- The Cuffcompare file is derived from a RNA-Seq study by Slutte, et al., (2012). From this publication, we first acquired the SRA files for Arabidopsis thaliana (from WT flower tissue, Col-0 ecotype), reassembled them using TopHat and Cufflinks within the DE, and then ran Cuffcompare to generate the final output.
- TSS data was kindly provided by Dr. Molly Megraw at Oregon State University.
- The known lincRNAs are derived from Liu, et al., 2012 (Plant Cell).
Starting an Evolinc-I job in the DE
- Open the DE Apps window and search for Evolinc-I v1.0.
- In the Analysis Name: Evolinc-I panel:
- Change the name for your analysis (optional).
- Enter any comments (optional).
- In the Select output folder field, click Browse and navigate to the folder of your choice. You can leave the default name iplant/home/username/analyses.
- To retain copies of the input files in your analysis results output folder, click the Retain Inputs checkbox.
- Click the Inputs panel:
- For the Cuffcompare output file, browse to select Sample_cuffcompare_out.gtf.
- For the Reference Genome file, browse to select TAIR10_chr.fasta.
- For the Reference Transcriptome annotation file, browse to select TAIR10_genes.gff.
- For the Transposable elements file (optional), browse to select TE_RNA_transcripts.fa.
- For the TSS file (optional), select Sample_TSS.gff.
Optional: For the Known lincRNA file, browse to select Sample_known_lncRNAs.gff.
Please note that the last three files in the Inputs panel are optional and will not affect the way in which Evolinc-I runs.
- Click the Output panel and enter the name of the output folder or you can leave the default name.
- Click Launch Analysis.
Output from Evolinc-I
The Evolinc-I workflow analysis produces a folder named Evolinc-I_output (default), containing several output files and an output folder:
- lincRNAs.fa - Final Long intergenic ncRNA transcripts in fasta format (contains all isoforms for each lincRNA locus)
- lincRNAs.bed - Final Long intergenic ncRNA transcripts in bed format
- lincRNA_updated.gtf - Final updated Cuffcompare output with the final Long intergenic ncRNA transcripts
lincRNA_demographics.txt
Output file 1: A text file showing the demographics of identified lincRNA
- final_Summary_table.tsv
Output file 2: A csv file showing the summary of each of the lincRNA generated
- lincRNA_piechart.png
Output file 3: Pie chart showing the percent of different kinds of lincRNAs
- lincRNAs.with.CAGE.support.annotated.fa - Final Long intergenic ncRNA transcripts that have overlapping with the TSS transcripts (generated only when TSS file is provided)
- lincRNAs.overlapping.known.lincs.fa - Final Long intergenic ncRNA transcripts in fasta format that have overlapping with the known lincRNA (generated only a known lincRNA file is provided)
- elapsed_time-evolinc-i.txt - A file showing the time taken by each of the different steps of Evolinc-I
- Other_lncRNA - A folder containing other files pertaining to non-lincRNAs:
- AOT.fa
- AOT.all.bed
- AOT_demographics.txt
- SOT.fa
- SOT.all.bed
- SOT_demographics.txt
- TE_containing_transcripts.fa
- TE_containing_transcripts.bed
Evolinc-II
Methodology
Prerequisites
- A CyVerse account (Register for a CyVerse account at https://user.cyverse.org/).
- An up-to-date, Java-enabled web browser. (Firefox recommended. If you wish to work with large datasets of your own and upload them using iCommands, since Chrome is not suitable due to its issues in utilizing 64-bit Java.)
- A tab-delimited "BLASTing" list, which tells the workflow which comparisons you want made and the order in which you want them made. An example is below.
- A folder containing all of your files within your DE directory. This folder should contain all of the files you are using in this analysis, including genomes, genome annotations, lincRNA.fasta files, BLASTing list, and Species list.
- A list of all the species you are analyzing (Species list), listed in 4-letter format and organized based on phylogenetic relationship. One species per line, separated by newlines and not carriage returns (a common PC issue). See example below.
- A BLAST e-value that you want to apply to all pair-wise comparisons. This value is naturally set at 1e-20, which we have found to be a good conservative value for comparisons across Brassicaceae.
Test/sample data
This tutorial uses data from the plant family Brassicaceae published in Nelson et al., 2016, stored in the Data Store, and available at Community Data > iplantcollaborative > example_data > Evolinc.II.sample.data. The sample data include a Sample_BLASTing_list.txt, 10 reference genomes, genome annotation files for these genomes, where available, a set of 100 lincRNAs described by Liu, et al., 2012, and an Atha_species_list.txt file of all the species used, ordered according to their known phylogeny. A description of how the files should be organized is below:
Nomenclature
Throughout the Evolinc-II documentation, we refer to a 4-letter format for naming files according to species. The 4-letter format corresponds to the first letter of the genus and the first three letters of the species. For example, Arabidopsis thaliana is Atha. Homo sapiens is Hsap, and so on. Abbreviating genus species in this manner simplifies things immensely and is crucial for Evolinc-II to make correct associations between file names and lincRNA orthologs.
Sample_BLASTing_list.txt: Getting the BLASTing list organized correctly is crucial. If the file names are incorrect, or if files are listed in the incorrect order, then Evolinc-II will fail. The BLASTing list should be tab-delimited, and organized with the subject (in BLAST terminology) genome in the first column, query lincRNA file in the second column, Query species ID in 4-letter format in the third column, and subject species ID in 4-letter format in the fourth column. The fifth column is where the subject genome annotation file goes, if you are including it. The sixth column is where the subject lincRNA.fasta file go, if you are including it. The fifth and sixth columns do not have to be populated, but only genome annotation can go in the fifth column, and only known lincRNAs can go in the sixth. For example:
Subject
GenomeQuery lincRNA
FileQuery Species
ID*Subject Species
ID*Subject Genome
Annotation FileSubject lincRNA.fasta
FileAtha_genome.fasta Atha_lincRNAs.fasta Atha Atha Atha_genes.gff Alyr_genome.fasta Atha_lincRNAs.fasta Atha Alyr Alyr_genes.gff Alyr_known_lincRNAs.fasta Crub_genome.fasta Atha_lincRNAs.fasta Atha Crub Crub_genes.gff Brap_genome.fasta Atha_lincRNAs.fasta Atha Brap Brap_known_lincRNAs.fasta Aara_genome.fasta Atha_lincRNAs.fasta Atha Aara * 4-letter format
This BLASTing list shows each of the different combinations you might run into when running Evolinc-II. Preferably, this file should either be generated in a text editor that allows you to see special characters (carriage returns, new lines, and tabs), or in the DE under File -> Create -> New Plain Text File. When creating the BLASTing list file, type in the name, press Tab, then the next name. If you have known lincRNAs to compare against but not a genome annotation file, make sure that there are two Tabs between the query ID and the name of the known lincRNA file (as shown in row 4 above). If you do not have a genome annotation file or a known lincRNA file, press Tab to create empty columns, but press Enter to proceed to the next line immediately after the query ID (as shown in row 3). If you have genome annotation files but not known lincRNAs, create a newline after the genome annotation file. Lastly, for reciprocal BLAST purposes, include a query genome - query lincRNA BLAST line, as shown in row 1. In addition to providing files necessary for the reciprocal BLAST step, it creates a file that tells the user how many other loci in the query genome share high similarity a query lincRNA. This is useful in determining if your lincRNA contains a highly repetitive element (such as TE, microRNA, or other RNA) that may influence your downstream decision-making process in regards to functional analysis.
Note that all files contain the 4-letter genus species format to start off, and then contain either genome, gene, known_lincRNA, etc. After the 4-letter ID, the file can be named anything you want. To keep things simple and prevent recording the wrong file name, we simplify by having all files of a similar type (i.e., genomes) to be named similarly. When creating the BLASTing file, this allows you to quickly scan for the file names to be of similar lengths. You do not have to provide a path length to the directory in which these files are stored. That is provided when you start the job. However, all of the files need to be in the same location. The rows in the BLASTing list can be in any order; using the example above, the Brap - Atha comparison could be listed first, and the Atha - Atha comparison last. All comparisons will be made, so the order is not important.
- Query lincRNA file: The query lincRNA file can be taken straight from Evolinc-I, or it can be obtained from a published or private resource. Depending on your downstream goals, you may want to screen your lincRNA file against known RNAs using the rFAM batch BLAST tool located here. RNAs stored in rFAM that typically show up in our datasets are snoRNAs, miRNAs, and occasionally rRNAs. These will skew the apparent conservation of lincRNA loci, as they exhibit on average higher sequence conservation than most lincRNAs in our experience. It may also be useful for the user to merge lincRNA isoforms derived from the same locus into one longer lincRNA, as this can also bias the final results.
- Folder of files: This folder can be anywhere within your directory on the Data Store. However, all files should be in the same folder and should be named exactly as they appear in the BLASTing list.
Species list: This file is a single column list of 4-letter format species IDs as mentioned above. The list needs to be arranged roughly according to phylogenetic relationships. If the exact relationships are not known, make an educated guess. For most published genomes, relationships are known at least at the level of genera.
Make sure the Species list does not contain any carriage returns.
For the example above, the species list would be:
- BLAST e-value: Lastly, an e-value is required for all pairwise comparisons. We would recommend empirically determining this, but start at 1e-20. We have found that lowering the e-value substantially (1e-5) doesn't pull out a large number of false positives, due to our inclusion of a reciprocal BLAST step. Typically low e-value hits do not BLAST back to the same locus, and are therefore removed. The only issue that may arise with low e-values is that time to completion of the job may go up as more sequences are processed at each step.
Starting an Evolinc-II job in the DE
- Open the DE Apps window and search for Evolinc-II.
- In the Analysis Name: Evolinc-II panel:
- Change the name for your analysis (optional).
- Enter any comments (optional).
- In the Select output folder field, click Browse and navigate to the folder of your choice, and then either enter the name of an output folder or leave the default name.
- To retain copies of the input files in your analysis results output folder, click the Retain Inputs checkbox.
- Click the Inputs panel:
- For the BLASTing list, browse to select the BLASTing_list.txt file. This file should contain 10 rows.
For query lincRNA, choose Atha_sample_lincRNAs.fasta.
For input files, navigate to the Evolinc.II.sample.data folder. This folder will appear empty, but do not worry, the files are there; you are simply telling Evolinc-II in which folder all of your files are located.
Choose the Sample_species_list.txt, which should have 10 entries.
Provide the query label, which is the 4-letter query ID. For this tutorial, it is simply Atha (for A. thaliana).
Under outputs, you can provide a name for the output folder. If not provided, it is given the generic ID "output".
Lastly, you can change the e-value used throughout Evolinc-II's comparisons. This is naturally set at 1e-20.
- Click the Output panel and either enter the name of the output folder or leave the default name.
- Click Launch Analysis.
Output from Evolinc-II
While there are many potentially interesting output files generated by Evolinc-II, only two are of immediate interest.
The first file is a bar graph depicting the percent of loci identified in other species that are orthologous in nature to the query lincRNAs.
In the example below, which was generated using the example A. thaliana dataset in the DE, we see that many of our query lincRNAs are identified in the genomes of other close relatives within Brassicaceae. This gives us an immediate snapshot of the level of conservation of our query lincRNAs, but it doesn't say much about what they are overlapping with in those other genomes.
Output file 1: Bar graph showing percent of loci identified in other species that are orthologous in nature to query lincRNAs
The second output file is a tab-delimited text file listing all of the query lincRNAs.
In the example below, which is a screenshot from the DE, you see the same species as above occupying a column. The query lincRNA IDs are listed to the left. If you read left to right, you see the species, in order of accepted phylogenetic relationship to A. thaliana (based on the species list you provided). If a homologous sequence was identified in a species, then at minimum there will be the term "Homolog" listed. This indicates that Evolinc-II successfully identified a genomic region with sufficient sequence similarity to your query lincRNA. If not, then the term "Not found" will be shown. As you search in species more distantly related to your query, you will recover fewer and fewer sequence homologs.
If Evolinc-II identifies a homologous region and that region overlaps with a known gene or known lincRNA, you will see those terms listed underneath the species ID instead of "Homolog". For instance, the query lincRNA At1NC000060 has sequence homologs in five of the six species listed. In addition, those homologous regions are expressed as a lincRNA in A. lyrata, and as a known gene in B. rapa. Depending on the level of annotation in the species you include in your analysis, Known_gene may indicate only protein-coding genes, or it may indicate a variety of gene types, such as lincRNAs, protein-coding genes, or transposable elements. In a later release of Evolinc-II, we will include in this table the gene ID of the overlapping gene to make back referencing easier.
Output file 2: Tab-delimited text file of all query lincRNAs
Troubleshooting tips
While Evolinc is still a work in progress (i.e., there are still bugs we are fixing), there are a few things that the user can do to make things run smoothly. In this section we will attempt to describe the most common causes of a failed analysis (or factors you should think about to get the most information out of your analyses). In the future, we will include questions that users have asked if they are not already addressed here. For each comment, we have listed which Evolinc app it will likely affect (Evolinc-I or Evolinc-II). If you have a question that is not addressed in this wiki, please feel free to email the authors, Upendra or Andrew.
- Concordance between genome.fasta, genome annotation.gtf, and cuffcompare/cuffmerge.gtf files (Both Evolinc-I and Evolinc-II):
- One common issue for anyone working with genome/transcriptome assembly is lack of concordance between genomes and their corresponding annotation files. Typically this is seen in differences in the way in which different files report the chromosome ID. For instance, one file might designate chromosome 1 as "Chr1", whereas another file might simply denote it as "1". We use Cufflink's gffread program to extract fasta sequences from the original cuffcompare/cuffmerge file. If there is no concordance between the three files, Evolinc-I will fail quickly. You can preview your genome, cuffmerge/cuffcompare, and genome annotation files in the DE. The first column of the .gtf files should match what comes immediately after the ">" in the genome.fasta file. If they do not, the easiest file to change is typically the genome annotation and cuffmerge/cuffcompare files, as they are smaller and easier to download. A simple find and replace in the command line or a text editor will fix this. We do not recommend using Excel or a similar spreadsheet program to perform this as they typically add quotes in unfriendly places after saving.
- In the future we will work on adding a check step in the beginning to make sure that these files agree with one another. We will update the FAQs when this happens.
- Extended lincRNA IDs with spaces (Evolinc-II):
- Another common issue comes from having long lincRNA IDs with spaces, for example "
LincRNA 1
". Please replace all spaces or any other special characters. For instance, we typically use an underscore. Thus, a less problematic ID would be "LincRNA_1
".
- Another common issue comes from having long lincRNA IDs with spaces, for example "
- Picking the right phylogenetic depth for comparative genomic analyses (Evolinc-II):
- The lincRNA loci appear to be conserved to varying degrees in different eukaryotic lineages. Most of what we know about lincRNA conservation comes from vertebrates and, to a lesser degree, plants. What we have seen in plants is that comparative genomic analyses of plant lincRNAs are most informative if restricted to the family level. However, informative comparisons can be made at much deeper nodes when starting with mammalian lincRNAs. Thus, we suggest picking a few genomes to start your analysis: a couple of close relatives and then a few that are much more diverged. Run that, and then start filling the analysis in with more species.
- Best practices for RNA-seq assembly prior to Evolinc (Evolinc-I):
- While this is not necessarily the appropriate place for listing all of the best practices for RNA-Seq assembly, it is important to mention that steps you make prior to Evolinc-I can have a profound effect on the number and quality of lincRNAs you get out. There are different strategies for lincRNA discovery, some conservative and some more exploratory. Many decisions made prior to Evolinc-I will determine whether the user is taking a conservative or exploratory approach. An excellent example of a conservative and thorough approach to identifying lincRNAs in humans can be see in Cabili et al., 2011 by John Rinn's group. In addition, a review of some of the best practices for RNA-Seq can be seen here. Both articles are open access and highly worth the read.
- In summary, the user should think about coverage and the number of tissues/independent RNA-Seq analyses in which a lincRNA has been identified. While read coverage is derived from Cufflinks and therefore should be addressed prior to Evolinc-I, the reproducibility of a lincRNA in your dataset can be partially addressed by performing Evolinc-I on tissues separately and then merging GTF files afterwards.
- To filter or not to filter lincRNAs against rFAM (Evolinc-II):
- We have found that a number of lincRNAs (the amount varies by species and manner in which RNA-Seq was performed) contain high-sequence similarity to (i.e., contain) known RNAs such as small nucleolar, microRNAs, and spliceosomal RNAs. The lincRNA molecules that contain these sequence motifs may indeed be lincRNAs; retaining them in your dataset prior to running Evolinc-II may skew your overall conservation rates, as these sequences tend to be more conserved. Because we are trying to identify lincRNAs that are conserved independently of these particular motifs, we tend to filter all lincRNAs identified by Evolinc-I using the batch search at rFAM. However, this is definitely just personal preference and whether you keep these RNAs or remove them depends on your ultimate goals.
Frequently Asked Questions
- How long do both apps run?
- It depends on the size of the input files. For example both apps take ~10 minutes to run on the test datasets.
- Do I need to use both apps?
- Not necessarily. It depends on your study. If you just want to identify lincRNA only and do some downstream analysis such as Differential Gene Expression using the identified lincRNA, then you can stop at Evolinc-I. If you want to do some evolutionary analysis of the identified lncRNA, then you will only proceed to Evolinc-II.
- I have already identified lincRNA. Can i still use Evolinc-II?
- You can certainly use Evolinc-II with the identified lincRNA from a different software/tool.
- Can I see the source code for the apps?
- Yes you certainly can. Since the manuscript is under preparation, we will only share the code upon request.
- Does the Docker image exist for both apps?
- Yes they do exist. Since the manuscript is under preparation, we will only share the Docker image upon request.