Tree Reconcliation Workflow

Purpose

  • This page describes the workflow that goes from gene clusters (families) expressed both as coding sequences and the translated amino acit sequences in FASTA format through to gene trees and reconciled gene/species trees.
  • It is assumed that the sequences are either full open reading (ORFS) or at least in-frame partial ORFs that can be translated into amino acid sequences
  • It is further assumed that muscle, treebest, and primetv are installed.

Steps

Multiple sequence alignments

Amino acids

The program muscle (http://www.ebi.ac.uk/Tools/muscle/index.html), which claims to be faster and more accurate that CLUSTALW, is used to align the amino acid sequences.

The command line incantation is like so:

muscle unaligned_AA.fa > aligned_AA.mfa 

NOTE: There are a BioPerl wrappers for running muscle and also for handling multiple sequence alignments but there were too many fatal exceptions being thrown do to a few weird sequences (length mismatch issues). I used direct system calls and file I/O in the script described below.

Nucleotides

We can now use the aligned amino acid sequences to back-translate to the nucleotide sequence alignment. treebest (http://treesoft.sourceforge.net/treebest.shtml) has a backtranslate function. The purpose of this is to make sure the codons align properly. treebest accepts only DNA alignments as input.

The command line incantation:

treebest backtrans aligned_AA.mfa unaligned_DNA.fa > aligned_DNA.mfa 

Building the gene tree

We also use treebest for this.

STILL WORKING ON THIS