PHYLIP_Documentation
Metadata Considerations
- PHYLIP input formats are very simple and only allow 10 characters for each species/taxon label.
- This limitation is going to require a solution to work with taxon labels > 10 characters without data loss. We will need to:
1) track metadata to link truncated or simplified taxon labels with the original, full-length labels
2) ensure that the 10-character taxon labels are unique within the fileNote: we can modify PHYLIP to use longer taxon names! See here
File and program documentation.
- documentation for PHYLIP programs and their inputs/outputs comes as HTML files with the distribution and is also available from the PHYLIP website.
Some topical examples or file format specs and program documentation are - The formats we will need to worry about most for the first iteration will be inputs for the CONTRAST method. The format specifications are documented here.
- Format definitions taken from the CONTRAST documentation (© Copyright 1991-2008 by the University of Washington. Written by Joseph Felsenstein. Permission is granted to copy this document provided that no fee is charged for it and that this copyright notice is not removed.):
The tree data are in Newick format:
The character data format below allows both interspecies and intraspecies variation:
The character data format below is for interspecies variation (character states are either single observation or mean value for the species/population):
- Descriptions of some PHYLIP input files for DNADIST, NEIGHBOR and CONTRAST and an introduction to using PHYLIP programs can be found PHYLIP_CONTRAST_Example and PHYLIP_NJ_Example
Inventory of sample data for PHYLIP
DNA Sequence
actin.fel, a multiple sequence alignment file in interleaved format. This file is suitable for input into molecular sequence based methods, such DNADIST, DNAML, and DNAPARS. Format specification is described here.
Distance Matrices
actin.dist, a DNA distance matrix, calculated using DNADIST, suitable for input into distance-based methods such as NEIGHBOR. Details on the generation of this file are covered in PHYLIP_NJ_Example. Format specification is here.
Continuous Character Data
PDAP.fel, a two-character data set for 49 mammalian species. This file is suitable for use in CONTRAST. The format specification is described above and here.
50K.continuous.fel, a synthetic data set for 50000 species, suiatbel for use on CONTRAST. Provenance of the original data is described in 50K_Synthetic_Data. Conversion to PHYLIP format is described in PHYLIP_CONTRAST_Example.
Tree Data
PHYLIP tree data format is the same as Newick format
actin.nj.treefile, the tree resulting from the phylogenetic analysis described in PHYLIP_NJ_Example. This format is also suitable as input data for PHYLIP programs that consume trees.
50K_final_newick.tre, a 50000-taxon synthetic tree provided by Brian Omeara. This tree has been tested with CONTRAST.
crimson50Knewick.tre, a 50000-taxon synthetic tree provided by Val Tannen.
mult_treefile_example.fel, an example of a PHYLIP treefile with multiple trees.
Shore bird data set
Considerations
- The fifth character in the trait data is non-numeric. I convert it to numeric here but it would likely be best to throw a warning here.
- The tree has a few minor polytomies. A warning would be good but I think the analysis should run anyway
Files
shorebirds.txt
shorebirds.fel
shorebirds.tree.fel
Conversion
- The script below will consume shorebirds.txt and convert it to correct PHYLIP format.
- Note the perl script is not the important part, the PHYLIP file format is the important part
- Note also that the perl script is specific to this file and is not generally applicable
perl fix_shorebirds.pl shorebirds.txt >shorebirds.fel
Note, I am using a 30-character limit for species names (see here)
#!/usr/bin/perl -w # Convert shorebirds.txt into a phylip file for CONTRAST use strict; my (%seen,$idx,%idx,@output,$traits); my $treefile = 'shorebirds.tree.fel'; while (<>) { next if /Species/; # No header line! my ($taxon,undef,@traits) = split; $taxon && @traits > 0 || next; $traits ||= @traits; $traits == @traits || die "$taxon has ".scalar(@traits)." traits. Should have $traits\n";; # The last trait value is actually discrete and non-numeric # we convert it here to numeric (maybe should delete it?) $idx{$traits[-1]} ||= ++$idx; $traits[-1] = $idx{$traits[-1]}; my $label = (length $taxon) < 30 ? sprintf('%-30s',$taxon) : substr $taxon, 0, 30; if ($seen{$label}++) { $label =~ s/\S$/$seen{$label}/; } (my $unpadded_label = $label) =~ s/\s+$//; s/$taxon\s+/$label/; `perl -i -pe 's/$taxon/$unpadded_label/' $treefile`; push @output, $label . join("\t",@traits); } print " " . scalar(@output) . " $traits\n"; print join("\n", @output), "\n";