Tree Reconcliation Workflow
Purpose
- This page describes the workflow that goes from gene clusters (families) expressed both as coding sequences and the translated amino acit sequences in FASTA format through to gene trees and reconciled gene/species trees.
- It is assumed that the sequences are either full open reading (ORFS) or at least in-frame partial ORFs that can be translated into amino acid sequences
- It is further assumed that muscle, treebest, and primetv are installed.
Steps
Multiple sequence alignments
Amino acids
The program muscle (http://www.ebi.ac.uk/Tools/muscle/index.html), which claims to be faster and more accurate that CLUSTALW, is used to align the amino acid sequences.
The command line incantation is like so:
muscle unaligned_AA.fa > aligned_AA.mfa
NOTE: There are a BioPerl wrappers for running muscle and also for handling multiple sequence alignments but there were too many fatal exceptions being thrown do to a few weird sequences (length mismatch issues). I used direct system calls and file I/O in the script described below.
Nucleotides
We can now use the aligned amino acid sequences to back-translate to the nucleotide sequence alignment. treebest (http://treesoft.sourceforge.net/treebest.shtml) has a backtranslate function. The purpose of this is to make sure the codons align properly. treebest accepts only DNA alignments as input.
The command line incantation:
treebest backtrans aligned_AA.mfa unaligned_DNA.fa > aligned_DNA.mfa
Building the gene tree
We also use treebest for this.
STILL WORKING ON THIS