/
SOFTPARSMAP

SOFTPARSMAP

Prepared by: Liya Wang

Tool Name: SoftParseMap

Homepage: http://www.cbu.uib.no/~steffpar/softparsmap/

Platforms: UNIX, WINDOWS

Implementation Language: JAVA

Documentation/Manual: http://www.cbu.uib.no/~steffpar/softparsmap/manual/manual.pdf

Purpose: Softparsmap is a java package that uses ’soft parsimony’ to root gene trees by mapping them on to a species tree.

Description:
Softparsmap outputs the rooting that minimizes the number of gene duplication and gene loss events. This parsimony is ’soft’ because, prior to rooting, branches with support less than a user-defined threshold are changed so that they agree with the species tree. In-paralogous can be removed either prior to or after rooting, or both, with the most recent complete sequence kept. Current available features of Softparsmap are

  • rooting free gene trees by minimizing duplications and losses while allowing weak edges to be collapsed
  • removing in-paralogous while rooting the gene tree or just before the result is saved
  • resolve uncertainties by inserting splits from the species tree and using out-groups
  • mapping rooted gene trees onto species trees
  • comparing gene trees using splits
  • the package is divided into parts that can be replaced to meet specific needs

Inputs: Sequence Information, Gene Trees, Species Trees and Property File

  • Sequence information is fed in the form of a darwin database file. Each sequence in this file should contain the following information in the following format. Other formats can be supported by including a new specification in the property file

sequence_identifier
peptide_sequence
Genus_species
gi_number
completeness

  • Gene Trees is fed using the Newick format with bootstrap / branch support values represented as internal node labels. Other variations on the newick format can be supported by including a new specification in the property file. These trees should be in the directory specified by the family group, and should be called family{identifier}.tree, where {identifier} is an integer that uniquely identifies each tree
  • Species Trees information is fed in the form of the GenBank files names.dmp and nodes.dmp, which are available at NCBI Taxonomy. Or you can use your own species tree information, as long as it is in the same format as these and that all the species in your gene tree are present
  • Property File describes where the input information is, where the output should go, and what format it is all in. Default formats are available, and are described as in the example property file, prop.xml

Outputs: All input files except prop.xml are rewritten after running, including nodes.dmp, names.dmp, db.drw, java_index, all tree files and all info files

Literature: A. C. Berglund, P. J. Steffansson, M. J. Betts, and D. A. Liberles, Optimal gene trees from sequences and species trees using a soft interpretation of parsimony , Journal of Molecular Evolution (2006).

Related Tools: Multiple sequence alignments (MSA) were calculated using POA (Grasso and Lee 2004) and phylogenetic trees were built using MrBayes (Huelsenbeck and Ronquist 2001).

Discussions: There are two classes of orthologs prediction methods:

  • Phylogeny based: NOTUNG, Orthostrapper, RIO, SoftparseMap, LOFT, Ensemble; Generally, these mehods rely on the detection of duplication and speciation events by comparing the gene trees with the species tree. When tree reconciliation analysis will associate a large number of duplication events with adaptive radiations, reconciliation methods that allow for non-binary species trees (SoftParseMap) should be used so that a large number of incorrect duplications are not inferred. Another advantage of SoftParseMap is that it allows some level of uncertainty in both the gene-trees and the species trees.
  • Pairwise based: BBH, COG, InParanoid, KOG, OrthoMCL, RSD, Multiparanoid, Roundup, OMA, eggNOG, Homologene; These methods take pairs of sequences, and from them construct a value that, in expectation, is additive under a stochastic model of site substitution. Most models assume a distribution of rates across sites, often based on a gamma distribution. Provided the (shape) parameter of this distribution is known, the method can correctly reconstruct the tree. However, if the shape parameter is not known then topologically different trees, with different shape parameters and associated positive branch lengths, can lead to exactly matching distributions on pairwise site patterns between all pairs of taxa. Thus, one could not distinguish between the two trees using pairs of sequences without some prior knowledge of the shape parameter. Importantly, this can happen for any choice of distinct shape parameters on the two trees, and thus the result is not peculiar to a particular or contrived selection of the shape parameters.

Comparisons: The following comparisons are collected from most recent publications.

  • SoftParseMap v.s. NOTUNG: SoftParseMap used an algorithm to infer a rooted, binary gene tree given a rooted, possibly non-binary, species tree and an unrooted, possibly non-binary, gene tree as input. This method selects a rooted, binary resolution of the input gene tree by minimizing first duplications and then losses. This parsimony criterion is encoded in open form expressions for the minimum number of duplications and losses; however, no algorithmis given for calculating these quantities. In addition, the expression for losses does not determine the species in which these losses occurred or the possible assignment of these losses to edges in the gene tree. Rather, it provides an estimated number of losses and is not guaranteed to find the optimal loss assignments. A similarity between SoftParseMap and NOTUNG is that both methods use set-based mappings between the gene and species trees. In NOTUNG, there are two such mappings, N and N_not, of size bounded above by kS . The “M-mapping” proposed in SoftParseMap is equivalent to N in
    NOTUNG, but there is no equivalent to N_not in SoftParseMap. Instead, SoftParseMap used a set Z that is bounded by VS and resembles the inefficient solution proposed and rejected when identifying duplications. ref: JOURNAL OF COMPUTATIONAL BIOLOGY, Volume 15, Number 8, 2008, Reconciliation with Non-Binary Species Trees, BENJAMIN VERNOT, MAUREEN STOLZER, AITON GOLDMAN, and DANNIE DURAND.