clusterReport

clusterReport

Community rating: ?????

A utility to generate species-centric files and a report on the number clusters produced by OrthoMCL v1.4. This app will generate a log file with tables that contain counts of the numbers of sequences and clusters within and between species, and files of clusters for sequences that are uniquely present, absent, and shared for each species and pairwise species combination. See explanation of output files below for details.

Notes:

Please visit Cluster Orthologs and Paralogs and Assemble Custom Gene Sets to see how this app fits into the larger workflow. Based on this workflow, this app can be run directly on OrthoMCL v1.4 output, or on the output of the appendUnclustered app. Both options are shown below and are referred to respectively as 'without unclustered added' and 'with unclustered added' throughout. It is strongly recommended that you carefully inspect the example data for both of these to understand the differences in the reports and output files generated for each. As detailed in the Cluster Orthologs and Paralogs and Assemble Custom Gene Sets workflow, and in the appendUnclustered app documentation, output generated in the 'with unclustered added' example will include one, single-sequence cluster for each sequence that was not clustered by OrthoMCL v1.4. Each of these will be counted as a 'uniquely present' cluster by this app.
App adapted from PERL script originally written by Chih-Horng Kuo

Quick Start

To use clusterReport you will need the mcl directory produced either by theOrthoMCL v1.4 or appendUnclustered app, and the GG file created as pat of the Cluster Orthologs and Paralogs and Assemble Custom Gene Sets workflow.

Test Data

Input test data for this app appears directly in the Discovery Environment in the Data window under Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 4_Concatenate_Multiple_Files_output

and

without unclustered added: Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 8_OrthoMCL_output -> Nov_14
with unclustered added: Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 9_appendUnclustered_output -> unclustered_Added

Output test data for this app appears directly in the Discovery Environment in the Data window under:

without unclustered added: Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 10_clusterReport_output -> without_unclustered_added
with unclustered added: Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 10_clusterReport_output -> with_unclustered_added

Input File(s)

Use Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 4_Concatenate_Multiple_Files_output -> GG_Combined.txt

and

without unclustered added: Community Data -> iplantcollaborative -> example_data -> homolog_clustering ->8_OrthoMCL_output -> Nov_14 ->mcl/
with unclustered added: Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 9_appendUnclustered -> unclustered_Added/

as the test input file and mcl folder.

Parameters Used in App

There are no parameters for this app.

Output File(s) and Folder(s):

'logs' directory: Contains the job submission standard output and standard error files generated by CyVerse systems. Usually this will only be important for troubleshooting if your job does not run.
cluster_Report directory contains 1 log file and multiple species-centric files
- 0_clusterReport.log contains 2 tables. Each uses the 2-letter abbreviation for each species created in the Cluster Orthologs and Paralogs and Assemble Custom Gene Sets workflow (Tax column). The table headers are:
  - Tax=2-letter abbreviation for each species
  - nSeq = number of sequences
  - nGroup = number of clusters
  - +(UPre)= number of clusters where sequences from this species are uniquely present
  - -(UAbs)=number of clusters where sequences from this species are uniquely absent
  - The first table shows the number of sequences input into OrthoMCL for each species (nSeq Column), the number of clusters produced by OrthoMCL for each species (nGroup column), the number of clusters that contain sequences only from that species (+UPre) column), and the number of clusters that contain sequences from all other species (-(UAbs) column). +(Upre) and -(UAbs) can be thought of as clusters where sequences from that species are either 'Uniquely Present' or 'Uniquely Absent'.
  - The second table contains the number of clusters that contain sequences from each pairwise combination of species. Tax=2-letter abbreviation for each species. Note that for the with unclustered added scenario, the number of +(Upre) clusters can rise significantly because each unclustered sequence is added as a single-sequence cluster.
- species-specific files: The remaining files are named for the input species. This explanation used files in Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 10_clusterReport_output -> without_unclustered_added as an example. Files are named using the 2-letter abbreviation for each species created in the Cluster Orthologs and Paralogs and Assemble Custom Gene Sets workflow. 'Shared' files contain all OrthoMCL-detected clusters that contain sequences for those two species, regardless of the copy number in other species. Individual species files are either + or - . +files contain all OrthoMCL-detected clusters that contain sequences from only that species. These are the 'Uniquely Present' clusters from the first table in 0_clusterReport.log. - files contain all OrthoMCL-detected clusters that contain sequences from every examined species except that species. These are the 'Uniquely Absent' clusters from the first table in 0_clusterReport.log.
- Each species-specific file, contains one cluster per line. Each line follows the format: ortholog cluster#(#species in cluster:#sequences in cluster, comma delimited list of the # of sequences per species) tab delimited list of protein-encoding gene ids (each followed by 2-letter abbreviation). For example, consider the file Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 10_clusterReport_output -> without_unclustered_added -> Shared_NC_TG.group. Line 3 shows OrthoMCL cluster ID 1000. This cluster contains sequences from 4 species and 4 sequences, 1 per species. Their are 4 sequence IDs: 1000(4:4,NC:1,PF:1,TA:1,TG:1) NC03976(NC) PF04397(PF) TA00754(TA) TG00634(TG).
- Notes
  - Remember that full OrthoMCL output was made by the OrthoMCL v1.4 app. All '.group' files generated here are derived directly form the all_orthomcl.out file. See Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 8_OrthoMCL_output -> Nov14 -> all_orthomcl.out for an example. This should be considered part of the overall output.
  - OrthoMCL cluster IDs are arbitrary.
  - If there is more than one sequence ID from a species in the same cluster, this indicates that paralogs were detected.
  - In the with unclustered added scenario, the added single-sequence clusters will be counted as Uniquely Present in the 0_clusterReport.log and (+) files.