OrthoMCL v1.4

OrthoMCL v1.4

Community rating: ?????

OrthoMCL version 1.4 uses parsed BLASTp input to cluster homologs based on a Markov Clustering Algorithm. It is part of a larger Cluster Orthologs and Paralogs and Assemble Custom Gene Sets workflow.

Notes:

Please see visit Cluster Orthologs and Paralogs and Assemble Custom Gene Sets to see how this app fits into the larger workflow.
Note that this is a stable version of the OrthoMCL algorithm, but is not the latest version. Version 2.0 was re-engineered for large-scale analyses involving hundreds of genomes. Unless you plan to analyze on this scale, this Version (1.4) will likely meet your needs. Please visit the OrthoMCL Website for more, and to see if your species of interest has already been included in the freely available OrthoMCL DB.

Quick Start

To use OrthoMCL v1.4 you will need a GG file of organized gene IDs and a BPO file of parsed BLASTp output. Both are produced as part of the Cluster Orthologs and Paralogs and Assemble Custom Gene Sets workflow.
Resources:Cluster Orthologs and Paralogs and Assemble Custom Gene Sets, OrthoMCL Website, MCL Website

Test Data

Input test data for this app appears directly in the Discovery Environment in the Data window under
Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 4_Concatenate_Multiple_Files_output -> GG_Combined.txt

and
Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 7_parseBlastBpo_output -> parsedBLASToutput_bpo.txt

Output test data for this app appears directly in the Discovery Environment in the Data window under
Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 8_OrthoMCL_output

Input File(s)

Use GG_Combined.txt and parsedBLASToutput_bpo.txt from the directory above as test input.

Parameters Used in App

For your first attempt, default parameters are strongly encouraged unless you are an advanced user. See app interface information callouts for parameter explanations. See the OrthoMCL Web Site for pertinent publications and v1.4-specific release notes. If you wish to alter parameters to attempt to optimize results, the Inflation Index (aka -I flag) should likely be your first stop. See this page and section 7 of this page to get started.

Output File(s)

This explanation is based on the Output test data shown above in the 'Test Data' section. The main output directory contains:

'logs' directory: Contains the job submission standard output and standard error files generated by CyVerse systems. Usually this will only be important for troubleshooting if your job does not run.
'OrthoMCL_homolog_clustering_workflow_example.conf' file: Contains a record of the input files and parameters used. Use this to verify that your run executed as you intended. Some of this information is duplicated in files below, but is presented at this level for ease of use.
'OrthoMCL_homolog_clustering_workflow_example.log' file: Contains a log record of OrthoMCL output. This contains valuable information for you to ensure that the correct number of species and sequences went into the analysis and to see how many clusters were generated. While some lines of the file may not be meaningful to you, it is worth your time to have a look at this file as part of examining your output. Some of this information is duplicated in files below, but is presented at this level for ease of use.
'parsedBLASToutput_bpo.txt_bpo.idx': Contains OrthoMCL's index of the input BPO file. Unless you have a specific interest, you can ignore this file.
'Nov_14' directory: Contains the main program output and other accessory files needed for downstream processing in the Cluster Orthologs and Paralogs and Assemble Custom Gene Sets workflow.
- 'all_orthomcl.out' file: This is the main output file for the program. Each line of the text file contains an OrthoMCL cluster, ie a set of genes detected as evolutionarily related orthologs and paralogs. Each line begins with an arbitrary ID followed by the number of genes and number of taxa in that cluster, followed by a list of gene IDs and 2-letter taxa abbreviations chosen in the Cluster Orthologs and Paralogs and Assemble Custom Gene Sets workflow. The number of lines in this file should match the number of clusters generated, recorded in the OrthoMCL_homolog_clustering_workflow_example.log file. For example, consider cluster number 610 from the example all_orthomcl.out file

ORTHOMCL610(4 genes,4 taxa): NC01972(NC) PF02630(PF) TA00120(TA) TG07927(TG)

This cluster, number 610, has 4 genes from 4 taxa, one from each of the species used as input.

- 'all_orthomcl.pat' file: A pattern image file that summarizes the number of clusters, genes and taxa. You may ignore this file for the larger workflow.
- 'mcl' folder: Contains MCL program output. These files are used in later steps of the larger workflow, to add unclustered sequences to OrthoMCL output if desired.
- 'mtx' folder: Contains edge and weight output for each pairwise combination of species. These can be ignored at this step.
- 'orthomcl.log' file: Contains a log record of OrthoMCL output. Can be ignored as all information here is also in OrthoMCL_homolog_clustering_workflow_example.log.
- 'orthomcl.rbh' file: Contains reciprocal best hit data used by OrthoMCL to cluster homologs.
- 'orthomcl.setting' file: Contains a summary of the inputs, outputs, and parameters used for the analysis. There is some overlap between this file and OrthoMCL_homolog_clustering_workflow_example.conf above.

Tool Source for App:

See the Downloads section of the OrthoMCL Website.