PHYLIP_NJ_Example
Purpose
This is an example workflow that demonstrates how to use PHYLIP components to infer and plot a phylogenetic tree from a multiple sequence alignment, using the neighbor-joining method. It is also to demonstrate how to run this program in non-interactive mode, the first step to programmatic wrapping. This work flow covers the fundamentals; using PHYLIP in this way is more for prototyping or for one-off analyses. Integration into a high level application or batch processing will involve more abstraction.
Prerequisites
Access to a unix/linux shell
A multiple sequence alignment file from CLUSTALW (or some other source).
This work flow assumes that you have the BioPerl libraries and the PHYLIP binary executables compiled/installed on you system.
The DNA sequences
Original data
We are working with actin coding sequences described in the CLUSTALW workflow example.
View the "raw" FASTA file
Multiple sequence alignment format
The input format for DNA sequences is described in the PHYLIP documentation.
The starting point is aligned sequences in CLUSTALW format.
One easy way to convert this to PHYLIP (aka Felsenstein) format is with a BioPerl script:
#!/usr/bin/perl -w use strict; use Bio::AlignIO; my $infile = shift || die "Usage ./clustal2phylip.pl clustal_file\n"; -e $infile || die "Input file $infile does not exist!\n"; my $in = Bio::AlignIO->new( -file => $infile, -format => 'clustalw'); my $out = Bio::AlignIO->new( -format => 'phylip'); my $aln = $in->next_aln; $out->write_aln($aln);This would be run with this incantation
$ ./clustal2phylip.pl actin.aln >actin.felBelow is an excerpt from actin.fel:
6 1134 c_elegans ATGTGTGACG ACGA---GGT TGCCGCTCTT GTTGTAGACA ATGGATCCGG AATGTGCAAG c_briggsae ATGTGTGACG ACGA---GGT TGCAGCTCTC GTAGTGGACA ATGGCTCCGG AATGTGCAAG b_xylophil ATGTGTGACG AAGA---AGT TGCCGCTCTT GTTGTGGACA ATGGCTCCGG TATGTGCAAA c_oncophor ATGTGTGACG ACGA---GGT TGCTGCTCTT GTGGTTGACA ATGGATCCGG AATGTGCAAA p_magellan ATGTGTGACG ACGA---GGT AGCAGCTTTA GTAGTAGACA ATGGCTCCGG TATGTGCAAG s_salar ATGTGTGACG ACGACGAGAC TACTGCTCTT GTGTGCGACA ACGGCTCCGG CCTTGTGAAG GCCGGATTCG CCGGAGACGA CGCTCCACGC GCCGTGTTCC CATCCATTGT CGGAAGACCA GCCGGATTTG CCGGAGACGA TGCTCCACGC GCCGTCTTCC CATCCATCGT TGGACGCCCA GCCGGTTTCG CCGGAGATGA TGCCCCACGT GCCGTCTTCC CCTCCATTGT CGGAAGACCC GCCGGATTTG CCGGAGATGA CGCTCCTCGA GCTGTCTTCC CCTCCATCGT CGGCCGACCC GCCGGGTTCG CCGGAGACGA TGCTCCACGC GCTGTGTTCC CCTCCATTGT TGGAAGGCCC GCTGGCTTCG CCGGTGATGA CGCACCCAGG GCTGTCTTCC CCTCCATCGT TGGTCGCCCTCALCULATING THE DISTANT MATRIX
DNADIST Menu-driven interface
PHYLIP programs all use a command line menu
Below is the menu for DNADIST
Nucleic acid sequence Distance Matrix program, version 3.69 Settings for this run: D Distance (F84, Kimura, Jukes-Cantor, LogDet)? F84 G Gamma distributed rates across sites? No T Transition/transversion ratio? 2.0 C One category of substitution rates? Yes W Use weights for sites? No F Use empirical base frequencies? Yes L Form of distance matrix? Square M Analyze multiple data sets? No I Input sequences interleaved? Yes 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes Y to accept these or type the letter for one to changeWe will use the default parameters
1) Call the program$ dnadist2) Enter the file name at the prompt
dnadist: can't find input file "infile" Please enter a new file name>actin.fel3) Enter 'Y' at the menu to calculate the matrix using the default parameters
4) Save your work. The matrix will be saved to 'outfile', which you should rename to something else to clear the path for subsequent PHYLIP programs and also to save an audit trail that will not be accidentally overwritten.$ mv outfile actin.distThe resulting distance matrix looks like this:
6 c_elegans 0.000000 0.058113 0.108458 0.148962 0.154771 0.209534 c_briggsae 0.058113 0.000000 0.110660 0.168952 0.149338 0.201134 b_xylophil 0.108458 0.110660 0.000000 0.163680 0.130987 0.185266 c_oncophor 0.148962 0.168952 0.163680 0.000000 0.195618 0.206068 p_magellan 0.154771 0.149338 0.130987 0.195618 0.000000 0.175117 s_salar 0.209534 0.201134 0.185266 0.206068 0.175117 0.000000DNADIST Non-interactive mode
As with all PHYLIP programs, you can bypass the menu by creating a text file with the commands and options listed in the sequence you would enter them via the prompts and menu:
actin.fel YThen use this incantation:
$ dnadist < dnadist_commands.txt $ mv outfile actin.dist
Making the Neighbor Joining Tree
The PHYLIP implementation of the Neighbor-Joining algorithm is in the program NEIGHBOR
NEIGHBOR Menu-driven interface
Neighbor-Joining/UPGMA method version 3.69
Settings for this run:
N Neighbor-joining or UPGMA tree? Neighbor-joining
O Outgroup root? No, use as outgroup species 1
L Lower-triangular data matrix? No
R Upper-triangular data matrix? No
S Subreplicates? No
J Randomize input order of species? No. Use input order
M Analyze multiple data sets? No
0 Terminal type (IBM PC, ANSI, none)? ANSI
1 Print out the data at start of run No
2 Print indications of progress of run Yes
3 Print out tree Yes
4 Write out trees onto tree file? Yes
Y to accept these or type the letter for one to change
we are using the defaults here
1) Launch NEIGHBOR$ neighbor2) Input sequence
neighbor: can't find input file "infile" Please enter a new file name> actin.dist3) Enter 'Y'
4) Save your work, this time there is an outfile and a treefile$ mv outfile actin.nj.out $ mv outtree actin.nj.tree
The output file looks like this:
6 Populations Neighbor-Joining/UPGMA method version 3.69 Neighbor-joining method Negative branch lengths allowed +-c_briggsae ! ! +--b_xylophil 1-3 ! ! +-----c_oncophor ! +-4 ! ! +---p_magellan ! +-2 ! +------s_salar ! +-c_elegans remember: this is an unrooted tree! Between And Length ------- --- ------ 1 c_briggsae 0.03010 1 3 0.02968 3 b_xylophil 0.05082 3 4 0.00966 4 c_oncophor 0.09688 4 2 0.01641 2 p_magellan 0.06789 2 s_salar 0.10723 1 c_elegans 0.02801Note that the tree is unrooted.
The tree file looks like this:
(c_briggsae:0.03010,(b_xylophil:0.05082,(c_oncophor:0.09688, (p_magellan:0.06789,s_salar:0.10723):0.01641):0.00966):0.02968,c_elegans:0.02801);This is Newick format, with branch lengths