DNAML_bootstrap_Workflow
Purpose
The purpose of this document is to demonstrate how doing bootstrap replicate analysis with PHYLIP's SEQBOOT, DNAML and CONSENSE could be decomposed for use on Condor.
Assumptions:
- Perl installed
- PHYLIP installed
Alignment file
The starting point is an alignment of nine mitochondrial genomes. The alignments were done with MUSCLE.
The alignment file was then converted into PHYLIP format.
DNAML
The program used for the basic phylogenetic inference operation is DNAML. It is an implementation of the maximum likelihood method described here
PHYLIP programs all have interactive menus but, for convenience, I use a little perl wrapper that accepts as arguments the program name (DNAML), a command file to be passed via STDIN and output file name(s)
#!/usr/bin/perl -w #run_phylip.pl use strict; my $program = shift; my $command = shift; my $outfile = shift; my $outtree = shift; $program && $command && $outfile || die "Usage: ./run_phylip.pl program command_file output_file [output_tree]\n"; # get rid of previous work unlink "infile" if -e "infile"; unlink "intree" if -e "intree"; unlink "outtree" if -e "outtree"; unlink "outfile" if -e "outfile"; # run the specified program with command file system "$program < $command > $program.out"; system "mv outfile $outfile"; system "mv outtree $outtree" if $outtree; print "Done. Outfile saved as $outfile. Program output saved as '$program.out'\n";
- The command file is simple, it just says the name of the alignment file and 'Y' to proceed
fish_aligned.fel Y
- The DNAML run would be invoked as follows (using time)
$ time ./run_phylip.pl dnaml command.txt results.txt results.tre Done. Outfile saved as results.txt. Program output saved as 'dnaml.out' real 0m6.088s user 0m6.059s sys 0m0.029s
- So a single run takes about six seconds.
Bootstrap replicates
The thing that makes it a bit more complicated is the requirement to do bootstrap replicate analysis.
- Consider doing doing 2000 replicates, that would be about 12,000 seconds or > 3 hours. This is the part that could be paralellized
- The program used to create the 2000 replicate data sets is SEQBOOT. The command file for this is:
fish_aligned.fel R 2000 Y 777
This encodes a command to do 2000 bootstrap replicates using 777 as a random number string.
It is run with the incantation:
$ ./run_phylip.pl seqboot command.txt replicates.txt
dividing up the work (This is the bit where Condor would come in).
The replicate data file uses this header, we can use it as a delimiter for splitting
9 17913 C_lavaretu ---------- ---------- ---------- ---------- ---------- ---------- Ssalar CCTTCA---- ---------- ---------- ------GTAA AATTTAAACA AAAAAATTGT Ogorbuscha CCCCATATTT TTCCACCAAT CTTTAATCCG CGAATTATTT TTCCCAAACT TTTTGATTGT Omasou CCCCTAA--- ---------- ---------- ------A-AA AATTTAAACT TTTTGATTGT Onerka CCCCACA--- ---------- ---------- ------A-AA AACCCAAACT TTTTGATTGT Oketa CCCCATA--- ---------- ---------- ------ATAA AATTTAAACT TTTTGATTGT Otshawytsc CCCCGTA--- ---------- ---------- ------ATAA AATTTAAACT TTTTGATTGT Omykiss ---------- ---------- ---------- ---------- ---------- ---------- Oclarki CCCCATA--- ---------- ---------- ------ATAA AATTTAAACA ATTTGATTGT ... etc
Now we have to split up the one big file with 2000 data sets to 2000 data files. The script below will do this:
#!/usr/bin/perl -w # split.pl -- split big replicate file into one replicate/file use strict; -d 'split' || mkdir('split'); my $idx; open OUT, ">./split/".++$idx or die $!; while (<>) { if (/^\s+\d+/) { close OUT; open OUT, ">./split/".++$idx or die $!; print OUT $_; print STDERR "Replicate $idx \r"; } else { print OUT $_; } }
- We now have a "split" sub-directory with 2000 replicate files in it. This would be deployed in parallel and the tree files would need to be re-consolidated after.
After Condor
We will now have run 2000 instances of DNAML on the 2000 files. There will be 2000 trees (each named 'outtree' by DNAML).
- The trees can be consolidated into one file that would look like this (this example has 10 trees):
(Ssalar:0.04898,(Omasou:0.03914,((Otshawytsc:0.00409, Oketa:0.00432):0.02963,((Ogorbuscha:0.04560,Onerka:0.03225):0.01396, (Omykiss:0.02449,Oclarki:0.02578):0.01332):0.00409):0.00670):0.03735, C_lavaretu:0.10258); (Ssalar:0.04777,(Omasou:0.03462,(((Onerka:0.02829, Ogorbuscha:0.04473):0.01513,(Omykiss:0.02278,Oclarki:0.02603):0.01420):0.00480, (Otshawytsc:0.00500,Oketa:0.00427):0.02871):0.00963):0.03944, C_lavaretu:0.09908); (Ssalar:0.04738,(((Omykiss:0.02303,Oclarki:0.02558):0.01479, ((Ogorbuscha:0.04929,Onerka:0.02981):0.01347,(Otshawytsc:0.00454, Oketa:0.00469):0.02989):0.00413):0.00763,Omasou:0.03291):0.03470, C_lavaretu:0.10251); (((Oketa:0.00435,Otshawytsc:0.00591):0.02830,((Omasou:0.03957, (Omykiss:0.02261,Oclarki:0.02665):0.01460):0.00387, (Onerka:0.02864,Ogorbuscha:0.04717):0.01277):0.00669):0.03765, Ssalar:0.05094,C_lavaretu:0.10451); ((Omasou:0.03245,(((Omykiss:0.02312,Oclarki:0.02885):0.01412, (Ogorbuscha:0.04532,Onerka:0.02934):0.01328):0.00573, (Otshawytsc:0.00492,Oketa:0.00445):0.02820):0.00973):0.03989, Ssalar:0.04698,C_lavaretu:0.10134); (((Omykiss:0.02173,Oclarki:0.02647):0.01142,((Onerka:0.03011, Ogorbuscha:0.04546):0.01317,((Otshawytsc:0.00417, Oketa:0.00458):0.02683,Omasou:0.03915):0.00582):0.00400):0.04298, Ssalar:0.04537,C_lavaretu:0.09640); (Ssalar:0.04449,((((Otshawytsc:0.00505,Oketa:0.00525):0.02775, Omasou:0.03917):0.00684,(Ogorbuscha:0.04631,Onerka:0.03077):0.01193):0.00449, (Oclarki:0.02821,Omykiss:0.02270):0.01167):0.04288, C_lavaretu:0.09818); (Ssalar:0.04860,(Omasou:0.03679,((Onerka:0.02849, Ogorbuscha:0.04435):0.01337,((Omykiss:0.02336, Oclarki:0.02535):0.01430,(Otshawytsc:0.00540,Oketa:0.00455):0.03091):0.00409):0.00787):0.03759, C_lavaretu:0.10208); (((((Onerka:0.03149,Ogorbuscha:0.04389):0.01369, Omasou:0.03719):0.00494,(Omykiss:0.02325,Oclarki:0.02560):0.01416):0.00625, (Oketa:0.00442,Otshawytsc:0.00471):0.02498):0.04066, Ssalar:0.04512,C_lavaretu:0.10274); (Ssalar:0.04654,(((Omykiss:0.01983,Oclarki:0.02605):0.01525, (Omasou:0.04124,(Ogorbuscha:0.04070,Onerka:0.02885):0.01212):0.00426):0.00684, (Otshawytsc:0.00476,Oketa:0.00523):0.02523):0.03550, C_lavaretu:0.09941);
Now, we can take the majority rule consensus tree (again using the example above) using the program CONSENSE.
The command file:
treefile.txt y
where treefile.txt is the consolidated trees.
$ ./run_phylip.pl consense command.txt consensus.txt consensus.tree Done. Outfile saved as consensus.txt. Program output saved as 'consense.out'
and the consensus file looks like:
Consensus tree program, version 3.69 Species in order: 1. Ssalar 2. Omasou 3. Otshawytsc 4. Oketa 5. Ogorbuscha 6. Onerka 7. Omykiss 8. Oclarki 9. C lavaretu Sets included in the consensus tree Set (species in order) How many times out of 10.00 .*******. 10.00 ..**..... 10.00 ......**. 10.00 ....**... 10.00 ..******. 5.00 ....****. 3.00 Sets NOT included in consensus tree: Set (species in order) How many times out of 10.00 .*..****. 3.00 .*****... 2.00 .***..... 2.00 .*..**... 2.00 ..****... 1.00 .*....**. 1.00 ..**..**. 1.00 Extended majority rule consensus tree CONSENSUS TREE: the numbers on the branches indicate the number of times the partition of the species into the two sets which are separated by that branch occurred among the trees, out of 10.00 trees +-------Oclarki +--10.0-| | +-------Omykiss +--3.00-| | | +-------Onerka | +--10.0-| +--5.00-| +-------Ogorbuscha | | | | +-------Otshawytsc +--10.0-| +----------10.0-| | | +-------Oketa +-------| | | | +-------------------------------Omasou | | | +---------------------------------------C lavaretu | +-----------------------------------------------Ssalar remember: this is an unrooted tree!