queryOrthoMCL 1.0

queryOrthoMCL 1.0

Community rating: ?????

A utility to generate custom homolog sets based on the copy number of genes for each species. Search criteria are input in a text file created by the User.

Notes:

Please visit Cluster Orthologs and Paralogs and Assemble Custom Gene Sets to see how this app fits into the larger workflow. Based on this workflow, this app can be run directly on OrthoMCL v1.4 output, or on the output of the appendUnclustered app. Both input options are shown below and are referred to respectively as 'without unclustered added' and 'with unclustered added' throughout.
App adapted from PERL script originally written by Chih-Horng Kuo

Quick Start

To use clusterReport you will need the orthomcl.index, orthomcl.mclout files produced either by theOrthoMCL v1.4 or appendUnclustered app, the GG file created as pat of the Cluster Orthologs and Paralogs and Assemble Custom Gene Sets workflow, and a User-created text file of search criteria described below.

Test Data

Input test data for this app appears directly in the Discovery Environment in the Data window under Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 4_Concatenate_Multiple_Files_output and
Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 11_queryOrthoMCL_input
and

without unclustered added: Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 8_OrthoMCL_output -> Nov_14 -> mcl -> orthomcl.index and orthomcl.mclout
with unclustered added: Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 9_appendUnclustered_output -> orthomcl.index and orthomcl.mclout

Output test data for this app appears directly in the Discovery Environment in the Data window under:
Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 12_queryOrthoMCL_output

Input File(s)

Use Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 4_Concatenate_Multiple_Files_output -> GG_Combined.txt

Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 11_queryOrthoMCL_input -> minMax.txt

and

without unclustered added: Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 8_OrthoMCL_output -> Nov_14 -> mcl -> orthomcl.index and orthomcl.mclout

with unclustered added: Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 9_appendUnclustered_output -> orthomcl.index and orthomcl.mclout

Format of the minMax.txt file:

Each User must create their own minMax.txt for each unique query they wish to make. You can create the file in a text editor (not word processing software like Microsoft Word etc) and import it into the Discovery Environment, or create it within the Discovery Environment (Data Tab -> File -> Create -> New Plain Text File). Format the file with 3 tab-delimited fields per line and one line per species. The fields are:

2-letter species abbreviation - minimum number of genes for that species - maximum number of genes for that species

See the above file for an example used as part of the Cluster Orthologs and Paralogs and Assemble Custom Gene Sets workflow. Below is an example of the format.

NC,1,2
PF,1,2
TA,1,2
TG,1,2

*Note: *Some testing of the minMax.txt input file will help you understand the search criteria. Note that for both the min and max values, only clusters that exactly meet these criteria will be returned. For example, if you search for min=1 and max=3, only clusters with between 1 and 3 genes for that species will be returned. If you search for min=0 max =999999, clusters with any number of genes for that species will be returned (assuming that the species has less than 999999 genes).

Parameters Used in App

There are no parameters for this app.

Output File(s)

Expect 2 output files:

A log file with a summary of the search criteria
A '.group' file with the clusters that meet the search criteria, one per line. Each line follows the format: ortholog cluster#(#species in cluster:#sequences in cluster, comma delimited list of the # of sequences per species) tab delimited list of protein-encoding gene ids (each followed by 2-letter abbreviation). For example, consider the file Community Data -> iplantcollaborative -> example_data -> homolog_clustering -> 12_queryOrthoMCL_output -> Query.group Line 3 shows OrthoMCL cluster ID 67. This cluster contains sequences from 4 species and 8 sequences, 2 er species. There are 8 sequence IDs:
67(4:8,NC:2,PF:2,TA:2,TG:2) NC03984(NC) NC04882(NC) PF03069(PF) PF04899(PF) TA00779(TA) TA01773(TA) TG00643(TG) TG01355(TG)