fRNAkenseq (HTseq-with-BAM-input) Manual
Before Beginning, Please Create a CyVerse Account
One of the unique features of fRNAkenseq is that it is well integrated with the computational infrastructure of CyVerse. This allows communication and data sharing between fRNAkenseq and other powered-by-CyVerse sequence analysis applications. Authentication into fRNAkenseq is authentication with CyVerse. Once you have created a CyVerse account you will be able to use fRNAkenseq, in addition to other Powered-by-CyVerse tools such as the syntenic analysis tool, CoGE. This data cross-talk is useful as it provides a single entry point to multiple types of analysis. Because of the way fRNAkenseq is configured, once you create a CyVerse account you will need to log into CoGE before fRNAkenseq.
CyVerse Account Creation:
https://user.cyverse.org/register/
CoGE Link:
https://genomevolution.org/coge/
INTRODUCTION TO RNA SEQ AND INFORMATICS ANALYSIS
Transcriptome bioinformatics begins with the sequencing of a cDNA library, the results of which are typically received back from a sequencing center in the form of a FASTQ file. During library preparation, mRNA is reverse transcribed to DNA. Once sequenced, each read of cDNA can be aligned to a reference genome to determine the gene from which it was most likely transcribed. Because the number of cDNA molecules associated with a given gene should be proportionate to the number of transcribed RNA molecules, this allows a reasonable measure of the relative level of expression for a gene. A typical FASTQ file contains anywhere from 10 – 100 million reads. Enrichment analysis can be thought of as consisting of three conceptual steps: (1) mapping each read to the gene it was transcribed from, (2) quantifying the number of reads aligning to a given gene to calculate the relative expression level of that gene and (3) determining if, across experimental conditions, there is a significant change in the relative expression for any gene(s). Accurate quantification, however, is complicated and requires normalization for library size in addition to specialized statistics that determine if a given gene is enriched in a certain condition. fRNAkenseq is designed to help researchers make sense of high dimensional transcriptome data and determine which genes are most likely to be truly enriched. It does this by providing enrichment predictions from several popular algorithms. The researcher can then choose candidates for validation from the intersection of the different gene lists predicted to be enriched by each algorithm. The mapping component of transcriptome analysis is conducted by the MapCount section of fRNAkenseq. The enrichment prediction step is completed in the DiffExpress portion of fRNAkenseq.
DESCRIPTION OF WORKFLOW AND ALGORITHMS
Utilizing fRNAkenseq, affectionately abbreviated fRNAk, is simple. To complete the first two steps of RNA seq analysis- mapping and transcript quantification - simply navigate from the main page to MapCount. Select the libraries for which you want to quantify gene expression. Choose the genome representing the organism your samples are from. This genome will be pulled from the databank of over 20,000 fasta and annotation pairs available to fRNAk. These genomes will be processed by fRNAk using BowTie2 in order to enable use of the TopHat2 mapping algorithm which requires index FASTA files (Langmead et al., 2012). Also, choose the number of processors to devote to the mapping algorithms in order to parallelize their operations.
MapCount:
Read mapping is accomplished by TopHat2. TopHat2 is used as a robust splice-aware mapper and produces .bam files easily processed by downstream algorithms (Dachwan, 2013). Transcript quantification in fRNAk is accomplished by two algorithms producing separate output files: Cufflinks and HTSeq-Count. Both Cufflinks and HTSeq-Count rely on the .bam output from TopHat2. The separate roles of the varying output files is described well in the Introduction section of fRNAk. Cufflinks estimates expression in the normalizing units of FPKM (Fragments per kilo-base per million reads mapped). FPKM normalizes for both gene length and library size. HTSeq-Count, however, provides output in raw counts. Both HTSeq-Count and Cufflinks quantify transcript abundance by “binning” mapped reads to their associated genes as described in an annotation file. MapCount executes all of these programs through the interface which would otherwise be run by command line. Two separate algorithms are needed for transcript quantification due to the assortment of algorithms in DiffExpress and their varying requirements for data input formats. Once processed by MapCount, libraries can be used as input for DiffExpress.
DiffExpress:
DiffExpress detects statistically significant changes in gene expression in multiple libraries across conditions. To execute DiffExpress, select the option from the main menu. On the DiffExpress page, library numbers will be displayed. These libraries have gone through MapCount. Select libraries to represent each condition and provide each group with a corresponding name. Name this run of DiffExpress and select fRNAkenseq crunch. An email will be sent upon completion of DiffExpress so that you may return and view your results. DiffExpress runs the following enrichment prediction algorithms: CuffDiff(Trapnell et al., 2014), BaySeq (Hardcastle,, 2010), EdgeR (Robinson et al., 2010) and DESeq2 (Love et al., 2014). Each utilizes slightly different statistical approaches to identify significant enrichment. The most accurate and sensitive algorithm depends on the landscape of the data. Therefore, in order to avoid false positives, it is best to consider predictions of multiple algorithms. fRNAk makes it easy to do this. After DiffExpress completes, fRNAk provides options to explore results. To begin accessing results from MapCount or DiffExpress, from the main menu select Visualize Data. To visualize expression at the gene or isoform level through CummeRbund, at the Visualize Data page select CummeRbund. Choose the gene(s) for which you would like to visualize expression levels as well as the type of graph. Note, however, that CummeRbund only considers output from CuffDiff. To approach your data from another perspective, select Pathway Level from the Visualize Data page. This will take the user to webCHIRP, a reactome-based text mining tool developed by Liang Sun to detect pathway-level features in data. To view raw data in tab delimited format, select the Data Level option. Links will be provided to the output of each algorithm in lists representing genes significantly enriched according to 0,1,2,3 and 4 algorithms.
Using fRNAk Step by Step
Step 1: Please upload the files using the file management system. Fastq files are massive. Therefore, fRNAk requires that they be gzipped prior to uploading. They must be named in the following format: 123_fastq.gz where “123” is any valid integer; there may also be an alphanumeric sequencing tag, i.e. AGCT following the library number. For example, “123_AGCT_fastq.gz” is a valid file name but “123_fastq.gz_AGCT” is not a valid file name. For a more thorough description of the various algorithms of the pipeline please visit the Introduction page on fRNAk.
Access Test Data Set:
I have prepared a small test data set, consisting of chicken breast tissue and chicken heart libraries: Simply click the link, download onto your machine and then upload into fRNAkenseq through the uploads page. Of the libraries in this test data set three are breast tissue (18600,18700, 18900) and three are heart (21100, 21200, 21300).
Step 2: Map your reads once the fastq files are loaded
fRNAkenseq MapCount
In order to run MapCount, a user merely navigates to the MapCount page and selects one or more libraries. If a library is paired end, a checkbox will appear allowing the user to select the option of stranded data. When you choose a genome, auto-complete will provide you with a list of genomes available from the CoGE database. Many of these are boutique genomes associated with specific projects. Because of this, the first option is always a “fRNAk approved” genome if it is available, ensuring that the draft is complete and high quality. The “fRNAk approved genomes” include a set of nine manually curated samples directly from NCBI of the following, highly used organisms: Unless a user specifically recognizes a genome or wishes to use their own personal draft uploaded through CoGE, it is recommended that they select the fRNAk approved genome when it exists for their species. Please allow several seconds for autocomplete with the current version. MapCount will not execute if the fastq files are not named properly or a valid genome is not selected.
DiffExpress Step 3:
In order to run DiffExpress, a user navigates to the DiffExpress page from the main menu. They then select libraries for the control and experimental groups and name the analysis, as well as choose a genome with which to execute DiffExpress. A user should select the same genome as that which was used to run MapCount on the libraries. Due to the high variability in transcriptome libraries, DiffExpress currently requires biological replicates. This means that two or more libraries must be selected for both the control and experimental conditions. If a user wishes to compare expression analyses between only two libraries, they may download the results directly from MapCount and use common statistical software packages such as JMP or EXCEL to conduct their analysis. IT IS RECOMMENDED not to do this, however, as it is impossible to calculate intra sample variability with such an approach.
Step 4:Downloading/Visualizing/Interpreting Data: fRNAk has several ways for a user to interact with and analyze their data. To reach data from any MapCount or DiffExpress run, select Visualize from the main menu. One is then presented the following page presenting three options with which to analyze data:
CummeRbund: CummeRbund is the final component of the Tuxedo pipeline for differential expression analysis, allowing visualization of genes at the isoform and gene level. CummeRbund is an R package and requires R command-line experience, but fRNAk runs it through the interface: Select the genes for which one wishes to view gene expression graphs and select whether one wishes to see plots of isoform or gene level expression data. Below is shown a sample graph for a chosen gene AHCY.
Pathway Analysis: Although currently for the Bird Genomics Community at the moment, users may also conduct pathway analysis using a tool developed by Liang Sun, GallusReactome Plus – a Gallus-oriented resource.
Retrieve Data:
Selecting Data Level Results allows a user to access flat files from a MapCount or DiffExpress Run. After select Data Level, a user will be taken to a page to select the analysis for which the wish to view results. Every MapCount run begins with a library_X where X is the library number. After Selecting a library, the user is then taken to a page with links to each of the relevant Cufflinks flat files, with a description of the content of each:
A user may also select a DiffExpress run from this same menu. Each DiffExpress Run is described by analysis_X where X is the name of the analysis ?chosen when the DiffExpress run is set up:
Links to tab delimited files with expression data for genes significantly enriched according to 0, 1, 2, 3 and all four enrichment analysis algoriths (cuffDiff,edgeR,BaySeq,DESeq2) is provided. Each of the flat files from CuffDiff is provided to the user, as well as a summary of the content of each file.
Column1: gene name
Column 2: cuffdiff_ctrl_FPKM -
The average FPKM for a given gene in all libraries for the control group
Column 3: cuffdiff_exp_FPKM -
The average FPMK for a given gene in all libraries for the experimental group
Column 4: cuffdiff_log2 -
The log2 fold-change between the control FPKM and the experimental FPKM
Column 5: Cuff_diff_test_stat -
The test statistic for the difference between the cuffdiff_ctrl_FPKM and the cuffdiff_exp_FPKM.
Column 6: cuffdiff_p_value -
The p-value associated with the test statistic
Column 7: cuffdiff_q_value -
The adjusted p-value, or q-value for the cuffdiff test statistic. The q-value is important because it is adjusted for multiple hypothesis testing.
Column 8: edger_logFC – logFC between control and experimental mean FPKM in the algorithm edgeR
Column 9: Log2-counts-per-million: Average counts per million for a given gene in edgeR, across all libraries.
Column 10: edger_Pvalue: p-value for enrichment for the given gene, according to the prediction of the algorithm edgeR
Column 11: edger_FDR – the equivalent of an adjust p-value for edgeR. For a clearer explanation of the interpretation of the edgeR FDR, please see the following link and read the post by Simon Anders, one of the authors: “http://seqanswers.com/forums/showthread.php?t=17011”
Column 12: deseq2_baseMean –The average normalize expression level for that gene across all libraries
Column 13: deseq2_log2FoldChange – The log2fold change between the expression level for that gene between the control and experimental conditions.
Column 14: deseq2_lfcSE: standard error estimate for the prediction provided by DESeq2 for the log2-fold-change.
Column 15: this is the test statistic for the deseq2 log2-fold-change
Column 16: deseq2_pvalue - this is the p-value for the test statistic of deseq2.
Column 17: deseq2_padj – this is the adjusted p-vale for deseq2 (essentially the desq2-equivalent to the edgeR FDR and the cuffdiff q-value)
After the deseq2 output, the number of columns for the BaySeq output will be variable. Each library will get a column showing its raw, un-normalized counts in the BaySeq output. This will be helpful to the user because, if the libraries are all approximately the same size, it can illustrate variation.
Following the raw library read-count, will be the BaySeq statistics. BaySeq relies on Bayesian statistics to execute differential expression analysis and, thus the BaySeq predictions require somewhat different interpretation. BaySeq tests the model that: either the control level of expression is greater than the experimental condition’s level of expression (1>2) or vice versa (1<2). This is the column: bayseq_Ordering. The BaySeq likelihood represents the likelihood of this decision (the column: bayseq_likelihood). The next column, “bayseq False Discovery Differential Expression” which can be abbrieviated by FDR is, again, similar to an adjusted p-value. The final column, “bayseq_Family Error Rate Differential Expression”, which can be abbreviated FWER, is a more stringent version of the FDR. Essentially, the difference between the FDR and FWER is the difference between the Benjamini-Hochberg correction and the Bonferroni correction – two similar methods for controlling false discoveries. See: https://wiki.nbic.nl/images/4/43/20110825_Vanzwet_statisticaltesting.pdf for a good explanation of this.
?INTERACTING WITH iPLANT:
?UPLOADING PERSONAL ANNOTATION
fRNAkenseq is built to interface with the computational infrastructure of the iPlant collaborative. This means that, like other powered-by-iPlant tools, fRNAkenseq is able to access many different data files associated with a given user’s iPlant ID. One such tool is CoGE. Currently, CoGE is a powered by iPlant tool used to conduct synteny analysis. CoGE allows users to work with 20,000 compiled annotation and FASTA files pairs of which are bundled together as a “genome” structure. fRNAkenseq communicates with CoGE in order to access these publicly available genomes as well as ones privately associated with a user’s iPlant account. A user can use CoGE to upload a FASTA and annotation file which are then formatted to CoGE’s bundled FASTA-annotation “genome” data structure.??In order to upload a personal genome for fRNAenstein, a user should go to CoGE at genomeevolution.org
Select “Load a New Genome” so that the next page appears.
Loading the Genome:
Describe the personal genome with the “Describe the Genome” box. Then, load the FASTA file by clicking the Add Data button. Select the method by which to load the FASTA file, if it is locally available, you will select “Upload”.
Selecting a Personal Genome Saved Locally
Select “Upload” and choose the FASTA file. After the FASTA file is uploaded successfully, you will be able to upload the corresponding annotation. When using fRNAkenseq, you will now be able to select this genome through the chosen name and its unique ID.
References
?Dachwan K, Pertea G.,Trapnell, C., Pimental, H., Kelly, R., Salzberg S.L. “TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions”. Genome Biology. 2013.??Robinson, M.D., McCarthy, D.J., Smyth, G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010. Jan 1:28(1):139-140.??Trapnell, C.., Williams, B.A., Mortazavi, A., Kwan, G., van Baren., MJ..,Salzberg,S., Wold, B.J., Pachter, L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switch during cell differentiation. Nature Biotechnology. 28 511-515 (2010)??Love, M. Huber, W., Anders, S. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq with DESeq2”. Genome Biology. 2014, 15:550.??Langmead, B., Trapnell, C., Pop, M., Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences. Genome Biology 2009.??Langmead, B. Salzberg, S.L. Fast Gapped-Read alignment with Bowtie 2. Nat Methods. 2012 March 4:9(4)357-9.
?Hardcastle, T.J. Kelly, K.A. baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 2010. 11:422?