PHYLIP_NJ_Example
Purpose
This is an example workflow that demonstrates how to use PHYLIP components to infer and plot a phylogenetic tree from a multiple sequence alignment, using the neighbor-joining method. It is also to demonstrate how to run this program in non-interactive mode, the first step to programmatic wrapping. This work flow covers the fundamentals; using PHYLIP in this way is more for prototyping or for one-off analyses. Integration into a high level application or batch processing will involve more abstraction.
Prerequisites
- Access to a unix/linux shell
- A multiple sequence alignment file from CLUSTALW (or some other source).
- This work flow assumes that you have the BioPerl libraries and the PHYLIP binary executables compiled/installed on you system.
The DNA sequences
Original data
- We are working with actin coding sequences described in the CLUSTALW workflow example.
- View the "raw" FASTA file
Multiple sequence alignment format
The input format for DNA sequences is described in the PHYLIP documentation.
- The starting point is aligned sequences in CLUSTALW format.
- One easy way to convert this to PHYLIP (aka Felsenstein) format is with a BioPerl script:
#!/usr/bin/perl -w use strict; use Bio::AlignIO; my $infile = shift || die "Usage ./clustal2phylip.pl clustal_file\n"; -e $infile || die "Input file $infile does not exist!\n"; my $in = Bio::AlignIO->new( -file => $infile, -format => 'clustalw'); my $out = Bio::AlignIO->new( -format => 'phylip'); my $aln = $in->next_aln; $out->write_aln($aln);
- This would be run with this incantation
$ ./clustal2phylip.pl actin.aln >actin.fel
- Below is an excerpt from actin.fel:
6 1134 c_elegans ATGTGTGACG ACGA---GGT TGCCGCTCTT GTTGTAGACA ATGGATCCGG AATGTGCAAG c_briggsae ATGTGTGACG ACGA---GGT TGCAGCTCTC GTAGTGGACA ATGGCTCCGG AATGTGCAAG b_xylophil ATGTGTGACG AAGA---AGT TGCCGCTCTT GTTGTGGACA ATGGCTCCGG TATGTGCAAA c_oncophor ATGTGTGACG ACGA---GGT TGCTGCTCTT GTGGTTGACA ATGGATCCGG AATGTGCAAA p_magellan ATGTGTGACG ACGA---GGT AGCAGCTTTA GTAGTAGACA ATGGCTCCGG TATGTGCAAG s_salar ATGTGTGACG ACGACGAGAC TACTGCTCTT GTGTGCGACA ACGGCTCCGG CCTTGTGAAG GCCGGATTCG CCGGAGACGA CGCTCCACGC GCCGTGTTCC CATCCATTGT CGGAAGACCA GCCGGATTTG CCGGAGACGA TGCTCCACGC GCCGTCTTCC CATCCATCGT TGGACGCCCA GCCGGTTTCG CCGGAGATGA TGCCCCACGT GCCGTCTTCC CCTCCATTGT CGGAAGACCC GCCGGATTTG CCGGAGATGA CGCTCCTCGA GCTGTCTTCC CCTCCATCGT CGGCCGACCC GCCGGGTTCG CCGGAGACGA TGCTCCACGC GCTGTGTTCC CCTCCATTGT TGGAAGGCCC GCTGGCTTCG CCGGTGATGA CGCACCCAGG GCTGTCTTCC CCTCCATCGT TGGTCGCCCT
Alternate method
- The above script is but one example of the many ways to get to a phylip format multiple sequence alignment file.
- Another example is the script run_clustal.pl to do the alignment on the original fasta file and write the output directly to phylip format.
$ ./run_clustal actin.fa actin.fel phylip
- PHYLIP formatted multiple sequence alignment data can also be directly output from the CLUSTALW program (menu/keystroke: 2. Multiple Alignments -> 9. Output format options -> 4. Toggle PHYLIP format output)
Calculating the distant matrix
The PHYLIP program DNADIST is used to pre-process the sequence data into a DNA distance matrix.
DNADIST Menu-driven interface
- PHYLIP programs all use a command line menu
- Below is the menu for DNADIST
Nucleic acid sequence Distance Matrix program, version 3.69 Settings for this run: D Distance (F84, Kimura, Jukes-Cantor, LogDet)? F84 G Gamma distributed rates across sites? No T Transition/transversion ratio? 2.0 C One category of substitution rates? Yes W Use weights for sites? No F Use empirical base frequencies? Yes L Form of distance matrix? Square M Analyze multiple data sets? No I Input sequences interleaved? Yes 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes Y to accept these or type the letter for one to change
- We will use the default parameters
1) Call the program2) Enter the file name at the prompt$ dnadist
3) Enter 'Y' at the menu to calculate the matrix using the default parametersdnadist: can't find input file "infile" Please enter a new file name>actin.fel
4) Save your work. The matrix will be saved to 'outfile', which you should rename to something else to clear the path for subsequent PHYLIP programs and also to save an audit trail that will not be accidentally overwritten.$ mv outfile actin.dist
- The resulting distance matrix looks like this:
6 c_elegans 0.000000 0.058113 0.108458 0.148962 0.154771 0.209534 c_briggsae 0.058113 0.000000 0.110660 0.168952 0.149338 0.201134 b_xylophil 0.108458 0.110660 0.000000 0.163680 0.130987 0.185266 c_oncophor 0.148962 0.168952 0.163680 0.000000 0.195618 0.206068 p_magellan 0.154771 0.149338 0.130987 0.195618 0.000000 0.175117 s_salar 0.209534 0.201134 0.185266 0.206068 0.175117 0.000000
Distance matrices in PHYLIP
Documentation of inputs for distance-based methods in PHYLIP can be found here
DNADIST Non-interactive mode
- As with all PHYLIP programs, you can bypass the menu by creating a text file with the commands and options listed in the sequence you would enter them via the prompts and menu:
actin.fel Y
- Then use this incantation:
$ dnadist < dnadist_commands.txt $ mv outfile actin.dist
Making the Neighbor Joining Tree
The PHYLIP implementation of the Neighbor-Joining algorithm is in the program NEIGHBOR
NEIGHBOR Menu-driven interface
Neighbor-Joining/UPGMA method version 3.69 Settings for this run: N Neighbor-joining or UPGMA tree? Neighbor-joining O Outgroup root? No, use as outgroup species 1 L Lower-triangular data matrix? No R Upper-triangular data matrix? No S Subreplicates? No J Randomize input order of species? No. Use input order M Analyze multiple data sets? No 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Write out trees onto tree file? Yes Y to accept these or type the letter for one to change
- we are using the defaults here
1) Launch NEIGHBOR2) Input sequence$ neighbor
3) Enter 'Y'neighbor: can't find input file "infile" Please enter a new file name> actin.dist
4) Save your work, this time there is an outfile and a treefile$ mv outfile actin.nj.out $ mv outtree actin.nj.tree
cleanup
For low-level use of PHYLIP programs, be sure to remove or rename 'outfile' and 'outtree', or you will be prompted to overwrite them interactively, which screws up the non-interactive use described below
- The output file looks like this:
6 Populations Neighbor-Joining/UPGMA method version 3.69 Neighbor-joining method Negative branch lengths allowed +-c_briggsae ! ! +--b_xylophil 1-3 ! ! +-----c_oncophor ! +-4 ! ! +---p_magellan ! +-2 ! +------s_salar ! +-c_elegans remember: this is an unrooted tree! Between And Length ------- --- ------ 1 c_briggsae 0.03010 1 3 0.02968 3 b_xylophil 0.05082 3 4 0.00966 4 c_oncophor 0.09688 4 2 0.01641 2 p_magellan 0.06789 2 s_salar 0.10723 1 c_elegans 0.02801
- Note that the tree is unrooted.
- The tree file looks like this:
(c_briggsae:0.03010,(b_xylophil:0.05082,(c_oncophor:0.09688, (p_magellan:0.06789,s_salar:0.10723):0.01641):0.00966):0.02968,c_elegans:0.02801);
- This is Newick format, with branch lengths
NEIGHBOR Non-interactive mode
- Because we are using defaults, the sequence of inputs is simple
actin.dist Y
- Which is run like so:
$ neighbor < neighbor_commands.txt $ mv outfile actin.nj.out $ mv outtree actin.nj.tree
Viewing the tree
We will use the PHYLIP program DRAWTREE to render the tree
- Note if you have not done so already, copy one of the font files from the source code distribution (font1, font2, etc) to 'fontfile' in your working directory.
- The menu for DRAWTREE is shown below.
Unrooted tree plotting program version 3.69 Here are the settings: 0 Screen type (IBM PC, ANSI)? ANSI P Final plotting device: Postscript printer V Previewing device: X Windows display B Use branch lengths: Yes L Angle of labels: branch points to Middle of label R Rotation of tree: 90.0 I Iterate to improve tree: Equal-Daylight algorithm D Try to avoid label overlap? No S Scale of branch length: Automatically rescaled C Relative character height: 0.3333 F Font: Times-Roman M Horizontal margins: 1.65 cm M Vertical margins: 2.16 cm # Page size submenu: one page per tree Y to accept these or type the letter for one to change
- The options are documented here. We work out the display parameters we like through trial and error, end end up with a file like so:
outtree 0 P W 1024 768 V N R 45 L F 0 Y
- annotated:
Toggle off onscreen displayPlotting device (file format), windows bitmap, 1024 x 768 pixels0
Set preview to 'None'P W 1024 768
Set angle of rotation to 45 degreesV N
Set label angles to fixed, horizontalR 45
and 'Y' to run.L F 0
$ drawtree < drawtree-commands.txt $ mv plotfile actin.nj.tree.bmp
- and this is the tree, rendered as an unrooted network
Wrappers for PHYLIP
- There is limited support for running some PHYLIP components via BioPerl.
- Below is a generic perl script to run a PHYLIP program using a command file.
#!/usr/bin/perl -w # run_phylip.pl use strict; my $program = shift; my $command = shift; my $outfile = shift; my $outtree = shift; $program && $command && $outfile || die "Usage: ./run_phylip.pl program command_file output_file [output_tree]\n"; # get rid of previous work unlink "infile" if -e "infile"; unlink "intree" if -e "intree"; unlink "outtree" if -e "outtree"; unlink "outfile" if -e "outfile"; # run the specified program with command file system "$program < $command > $program.out"; system "mv outfile $outfile"; system "mv outtree $outtree" if $outtree; print "Done. Outfile saved as $outfile. Program output saved as '$program.out'\n";