PHYLIP_NJ_Example

PHYLIP_NJ_Example

Purpose

This is an example workflow that demonstrates how to use PHYLIP components to infer and plot a phylogenetic tree from a multiple sequence alignment, using the neighbor-joining method. It is also to demonstrate how to run this program in non-interactive mode, the first step to programmatic wrapping. This work flow covers the fundamentals; using PHYLIP in this way is more for prototyping or for one-off analyses. Integration into a high level application or batch processing will involve more abstraction.

Prerequisites

  • Access to a unix/linux shell

  • A multiple sequence alignment file from CLUSTALW (or some other source).

  • This work flow assumes that you have the BioPerl libraries and the PHYLIP binary executables compiled/installed on you system.

The DNA sequences

Original data

Multiple sequence alignment format

The input format for DNA sequences is described in the PHYLIP documentation.

  • The starting point is aligned sequences in CLUSTALW format.

  • One easy way to convert this to PHYLIP (aka Felsenstein) format is with a BioPerl script:

    #!/usr/bin/perl -w use strict; use Bio::AlignIO; my $infile = shift || die "Usage ./clustal2phylip.pl clustal_file\n"; -e $infile || die "Input file $infile does not exist!\n"; my $in = Bio::AlignIO->new( -file => $infile, -format => 'clustalw'); my $out = Bio::AlignIO->new( -format => 'phylip'); my $aln = $in->next_aln; $out->write_aln($aln);
  • This would be run with this incantation

    $ ./clustal2phylip.pl actin.aln >actin.fel
  • Below is an excerpt from actin.fel:

    6 1134 c_elegans ATGTGTGACG ACGA---GGT TGCCGCTCTT GTTGTAGACA ATGGATCCGG AATGTGCAAG c_briggsae ATGTGTGACG ACGA---GGT TGCAGCTCTC GTAGTGGACA ATGGCTCCGG AATGTGCAAG b_xylophil ATGTGTGACG AAGA---AGT TGCCGCTCTT GTTGTGGACA ATGGCTCCGG TATGTGCAAA c_oncophor ATGTGTGACG ACGA---GGT TGCTGCTCTT GTGGTTGACA ATGGATCCGG AATGTGCAAA p_magellan ATGTGTGACG ACGA---GGT AGCAGCTTTA GTAGTAGACA ATGGCTCCGG TATGTGCAAG s_salar ATGTGTGACG ACGACGAGAC TACTGCTCTT GTGTGCGACA ACGGCTCCGG CCTTGTGAAG GCCGGATTCG CCGGAGACGA CGCTCCACGC GCCGTGTTCC CATCCATTGT CGGAAGACCA GCCGGATTTG CCGGAGACGA TGCTCCACGC GCCGTCTTCC CATCCATCGT TGGACGCCCA GCCGGTTTCG CCGGAGATGA TGCCCCACGT GCCGTCTTCC CCTCCATTGT CGGAAGACCC GCCGGATTTG CCGGAGATGA CGCTCCTCGA GCTGTCTTCC CCTCCATCGT CGGCCGACCC GCCGGGTTCG CCGGAGACGA TGCTCCACGC GCTGTGTTCC CCTCCATTGT TGGAAGGCCC GCTGGCTTCG CCGGTGATGA CGCACCCAGG GCTGTCTTCC CCTCCATCGT TGGTCGCCCT

    CALCULATING THE DISTANT MATRIX

    DNADIST Menu-driven interface

  • PHYLIP programs all use a command line menu

  • Below is the menu for DNADIST

    Nucleic acid sequence Distance Matrix program, version 3.69 Settings for this run: D Distance (F84, Kimura, Jukes-Cantor, LogDet)? F84 G Gamma distributed rates across sites? No T Transition/transversion ratio? 2.0 C One category of substitution rates? Yes W Use weights for sites? No F Use empirical base frequencies? Yes L Form of distance matrix? Square M Analyze multiple data sets? No I Input sequences interleaved? Yes 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes Y to accept these or type the letter for one to change
  • We will use the default parameters
    1) Call the program

    $ dnadist

    2) Enter the file name at the prompt

    dnadist: can't find input file "infile" Please enter a new file name>actin.fel

    3) Enter 'Y' at the menu to calculate the matrix using the default parameters
    4) Save your work. The matrix will be saved to 'outfile', which you should rename to something else to clear the path for subsequent PHYLIP programs and also to save an audit trail that will not be accidentally overwritten.

    $ mv outfile actin.dist
  • The resulting distance matrix looks like this:

    6 c_elegans 0.000000 0.058113 0.108458 0.148962 0.154771 0.209534 c_briggsae 0.058113 0.000000 0.110660 0.168952 0.149338 0.201134 b_xylophil 0.108458 0.110660 0.000000 0.163680 0.130987 0.185266 c_oncophor 0.148962 0.168952 0.163680 0.000000 0.195618 0.206068 p_magellan 0.154771 0.149338 0.130987 0.195618 0.000000 0.175117 s_salar 0.209534 0.201134 0.185266 0.206068 0.175117 0.000000

    DNADIST Non-interactive mode

  • As with all PHYLIP programs, you can bypass the menu by creating a text file with the commands and options listed in the sequence you would enter them via the prompts and menu:

    actin.fel Y
  • Then use this incantation:

    $ dnadist < dnadist_commands.txt $ mv outfile actin.dist

Making the Neighbor Joining Tree

The PHYLIP implementation of the Neighbor-Joining algorithm is in the program NEIGHBOR

NEIGHBOR Menu-driven interface

Neighbor-Joining/UPGMA method version 3.69 Settings for this run: N Neighbor-joining or UPGMA tree? Neighbor-joining O Outgroup root? No, use as outgroup species 1 L Lower-triangular data matrix? No R Upper-triangular data matrix? No S Subreplicates? No J Randomize input order of species? No. Use input order M Analyze multiple data sets? No 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Write out trees onto tree file? Yes Y to accept these or type the letter for one to change
  • we are using the defaults here
    1) Launch NEIGHBOR

    $ neighbor

    2) Input sequence

    neighbor: can't find input file "infile" Please enter a new file name> actin.dist

    3) Enter 'Y'
    4) Save your work, this time there is an outfile and a treefile

    $ mv outfile actin.nj.out $ mv outtree actin.nj.tree
  • The output file looks like this:

    6 Populations Neighbor-Joining/UPGMA method version 3.69 Neighbor-joining method Negative branch lengths allowed +-c_briggsae ! ! +--b_xylophil 1-3 ! ! +-----c_oncophor ! +-4 ! ! +---p_magellan ! +-2 ! +------s_salar ! +-c_elegans remember: this is an unrooted tree! Between And Length ------- --- ------ 1 c_briggsae 0.03010 1 3 0.02968 3 b_xylophil 0.05082 3 4 0.00966 4 c_oncophor 0.09688 4 2 0.01641 2 p_magellan 0.06789 2 s_salar 0.10723 1 c_elegans 0.02801
  • Note that the tree is unrooted.

  • The tree file looks like this:

    (c_briggsae:0.03010,(b_xylophil:0.05082,(c_oncophor:0.09688, (p_magellan:0.06789,s_salar:0.10723):0.01641):0.00966):0.02968,c_elegans:0.02801);
  • This is Newick format, with branch lengths

NEIGHBOR Non-interactive mode