Prototype Phase 1

The detailed instructions I got from Sheldon were to get a test set of amino acide guided multiple dna alignments, using NCBI taxonomy tree in Newick format as species tree and run treebest. Once gene trees are generated from treebest, work out how to make a reconciled tree with PrIMETV. Here is the progress I made so far and the problems I encountered.

1. Using gramene gene tree database, dumped 1000 amino acid guided multi-sequence dna alignments saved as .mfa files

2. Get NCBI taxonomy tree from Andrew, the species names are binomial names such as "Azorhizobium_caulinodans".

3. Run treebest with this ncbi-taxonomy tree and a randomly chosen mfa file 1028.mfa

treebest best -f ../data/ncbi-taxonomy.nwk 1028e1.binomial.mfa
ERROR: syntax error in line 0, near (Rickettsia_sp.Digas,0)
Segmentation fault

     It seems treebest have difficulties interprete the special character '&' in the ncbi taxonomy file

perl -nle 'while (/.{100}Rickettsia_sp\._Digas.{100}/gc){ print "$&\n";}' ../data/ncbi-taxonomy.nwk
_a_bunchy_top_disease_rickettsia,Rickettsia_sp.La_Copita,Rickettsia_sp.MOAa,Rickettsia_sp._WB-8-2,Rickettsia_sp._Digas&__Belikov_1999,Rickettsia_sp.midichlorii,male-killing_Rickettsia_from_Adalia_decempunctata,Ricke_

     According to treebest documentation, in mfa file, species are recognized from the sequence names with underline `_' as the separator, here is the sequence names of tested mfa file

grep ">" 1028e1.binomial.mfa
>1028-123133_Populus_trichocarpa 3694/1-612
>1028-217511_Arabidopsis_thaliana 3702/1-1704
>1028-35170_Drosophila_melanogaster 7227/1-8007
>1028-57382_Drosophila_melanogaster 7227/1-2439
>1028-127008_Vitis_vinifera 29760/1-1227
>1028-398505_Oryza_sativa 39947/1-1533
>1028-27646_Saccharomyces_cerevisiae 4932/1-1317
>1028-303129_Homo_sapiens 9606/1-2511
>1028-210503_Arabidopsis_lyrata 59689/1-1704
>1028-25635_Sorghum_bicolor 4558/1-1548
>1028-123127_Populus_trichocarpa 3694/1-231
>1028-305701_Homo_sapiens 9606/1-456
>1028-57790_Caenorhabditis_elegans 6239/1-2301
>1028-140663_Caenorhabditis_elegans 6239/1-6798
>1028-310933_Homo_sapiens 9606/1-453
>1028-191699_Brachypodium_distachyon 15368/1-1521
>1028-328764_Zea_mays 4577/1-1545
>1028-194478_Homo_sapiens 9606/1-2496
>1028-116416_Ciona_intestinalis 7719/1-2415
>1028-211084_Ciona_intestinalis 7719/1-2220
>1028-343613_Populus_trichocarpa 3694/1-1350
>1028-126979_Vitis_vinifera 29760/1-978
>1028-311355_Homo_sapiens 9606/1-9138
>1028-475528_Oryza_indica 39946/1-1524
>1028-211056_Ciona_intestinalis 7719/1-2151
>1028-126996_Vitis_vinifera 29760/1-567

4. more treebest testings

       4.1. succeeded with gramene species tree gramene_taxid.nwk        

        (((((7719*,9606*)7711,7227*)33316,6239*)33213,4932*)33154,(((((39947*,39946*)4530,4533,65489,4535,4538,4537,4529,63629,4536)4527,15368*)359160,(4577*,4558*)147429)4479,(((59689*,3702*)3701,3694*)71275,29760*,4081*)91827)3398)2759; 

treebest best -f gramene_taxid.nwk 1028.mfa > 1028.nhx
Large distance encountered between 1028-57382_7227 and 1028-123133_3694  sequences         
Large distance encountered between 1028-127008_29760 and  1028-123133_3694 sequences         
Large distance encountered between 1028-27646_4932 and  1028-123133_3694 sequences         
Large distance encountered between 1028-303129_9606 and  1028-123133_3694 sequences         
Large distance encountered between 1028-123127_3694 and  1028-123133_3694 sequences         
Large distance encountered between 1028-305701_9606 and  1028-123133_3694 sequences2

         ...........    

        resulting 1028.nhx new hampshire extended tree

less 1028.nhx
(((1028-211056_7719:3.69269[&&NHX:E=$-9606-7227-6239:S=7719],
(((1028-211084_7719:0.596758[&&NHX:S=7719],
((1028-305701_9606:0.057505[&&NHX:S=9606],
1028-310933_9606:0.008573[&&NHX:S=9606]
):0.049416[&&NHX:D=Y:SIS=100:S=9606:B=100],
1028-311355_9606:0.025949[&&NHX:S=9606]
):0.388599[&&NHX:D=Y:SIS=100:S=9606:B=100]
):0.151399[&&NHX:D=N:S=7711:B=9],
1028-35170_7227:0.317028[&&NHX:S=7227]
):0.058103[&&NHX:D=N:S=33316:B=9],
1028-140663_6239:1.08681[&&NHX:S=6239]
):0.557148[&&NHX:D=N:S=33213:B=95]
):0[&&NHX:D=Y:SIS=25:E=$-4932:S=33213:B=0],
1028-123133_3694:2.74796[&&NHX:E=$-3701-29760-4081-4479:S=3694]
):0[&&NHX:D=N:S=2759:B=0],
(((((1028-116416_7719:0.320463[&&NHX:S=7719],
(1028-194478_9606:0.161617[&&NHX:S=9606],
1028-303129_9606:0.197453[&&NHX:S=9606]
):0.072651[&&NHX:D=Y:SIS=100:S=9606:B=96]
):0.035653[&&NHX:D=N:S=7711:B=95],
1028-57382_7227:0.415493[&&NHX:S=7227]
):0.148275[&&NHX:D=N:S=33316:B=94],
1028-57790_6239:0.69674[&&NHX:S=6239]
):0.194249[&&NHX:D=N:S=33213:B=82],
1028-27646_4932:0.499687[&&NHX:S=4932]
):0.136025[&&NHX:D=N:S=33154:B=32],
((((1028-398505_39947:0.000478[&&NHX:S=39947],
1028-475528_39946:0.044926[&&NHX:S=39946]
):0.057235[&&NHX:D=N:S=4530:B=100],
1028-191699_15368:0.076968[&&NHX:S=15368]
):0.017095[&&NHX:D=N:S=359160:B=87],
(1028-328764_4577:0.031159[&&NHX:S=4577],
1028-25635_4558:0.023845[&&NHX:S=4558]
):0.045985[&&NHX:D=N:S=147429:B=100]
):0.119085[&&NHX:D=N:S=4479:B=87],
(((1028-210503_59689:0.012642[&&NHX:S=59689],
1028-217511_3702:0.016721[&&NHX:S=3702]
):0.158796[&&NHX:D=N:S=3701:B=100],
(1028-123127_3694:0.0348[&&NHX:S=3694],
1028-343613_3694:0.027917[&&NHX:S=3694]
):0.050347[&&NHX:D=Y:SIS=100:S=3694:B=76]
):0.032792[&&NHX:D=N:S=71275:B=27],
((1028-126979_29760:0.189976[&&NHX:S=29760],
1028-126996_29760:0.357541[&&NHX:S=29760]
):0.05758[&&NHX:D=Y:SIS=100:S=29760:B=68],
1028-127008_29760:0.267072[&&NHX:S=29760]
):0.015254[&&NHX:D=Y:SIS=100:S=29760:B=0]
):0.086747[&&NHX:D=N:E=$-4081:S=91827:B=0]
):0.160108[&&NHX:D=N:S=3398:B=1]
):0.702063[&&NHX:D=N:S=2759:B=21]
)[&&NHX:D=Y:SIS=36:S=2759:B=0];

      4.2. failed with binomial named species tree gramene_binomial_taxname.nwk:

(((((Ciona_intestinalis,Homo_sapiens)7711,Drosophila_melanogaster)33316,Caenorhabditis_elegans)33213,Saccharomyces_cerevisiae)33154,(((((Oryza_sativa,Oryza_indica)4530,Oryza_brachyantha,Oryza_barthii,Oryza_officinalis,Oryza_glaberrima,Oryza_punctata,Oryza_rufipogon,Oryza_minuta,Oryza_nivara)4527,Brachypodium_distachyon)359160,(Zea_mays,Sorghum_bicolor)147429)4479,(((Arabidopsis_lyrata,Arabidopsis_thaliana)3701,Populus_trichocarpa)71275,Vitis_vinifera,Solanum_lycopersicum)91827)3398)2759;

treebest best -f gramene_binomial_taxname.nwk 1028e1.binomial.mfa > 1028e1.gramene.binomial.nhx
......
Large distance encountered between 1028-211056_Ciona_intestinalis and 1028-126979_Vitis_vinifera sequences
Large distance encountered between 1028-211056_Ciona_intestinalis and 1028-311355_Homo_sapiens sequences
Large distance encountered between 1028-211056_Ciona_intestinalis and 1028-475528_Oryza_indica sequences
Large distance encountered between 1028-126996_Vitis_vinifera and 1028-211056_Ciona_intestinalis sequences
Segmentation fault

ls -ltr 1028e1.gramene.binomial.nhx
-rw-rw-r-- 1 weix iplant 0 Apr  1 14:43 1028e1.gramene.binomial.nhx

     4.3 succeeded with Swiss-Prot encoded gramene species tree, where the length of each species name is no longer than 5 characters, gramene_swisspro_taxname.nwk:

(((((Cinte,Hsapi)7711,Dmela)33316,Celeg)33213,Scere)33154,(((((Osati,Oindi)4530,Obrac,Obart,Ooffi,Oglab,Opunc,Orufi,Ominu,Oniva)4527,Bdist)359160,(Zmays,Sbico)147429)4479,(((Alyra,Athal)3701,Ptric)71275,Vvini,Slyco)91827)3398)2759;

treebest best -f gramene_swisspro_taxname.nwk 1028e1.swisspro.mfa > 1028e1.gramene.swissprot.nhx
Large distance encountered between 1028-57382_Dmela and 1028-123133_Ptric sequences
Large distance encountered between 1028-127008_Vvini and 1028-123133_Ptric sequences
Large distance encountered between 1028-27646_Scere and 1028-123133_Ptric sequences
Large distance encountered between 1028-303129_Hsapi and 1028-123133_Ptric sequences
Large distance encountered between 1028-123127_Ptric and 1028-123133_Ptric sequences
Large distance encountered between 1028-305701_Hsapi and 1028-123133_Ptric sequences
....
 cat  1028e1.gramene.swissprot.nhx
(((((1028-211084_Cinte:0.610544[&&NHX:S=Cinte],
((1028-305701_Hsapi:0.057957[&&NHX:S=Hsapi],
1028-310933_Hsapi:0.008621[&&NHX:S=Hsapi]
):0.049534[&&NHX:D=Y:SIS=100:S=Hsapi:B=98],
1028-311355_Hsapi:0.026399[&&NHX:S=Hsapi]
):0.389161[&&NHX:D=Y:SIS=100:S=Hsapi:B=98]
):0.154374[&&NHX:D=N:S=7711:B=11],
1028-35170_Dmela:0.319704[&&NHX:S=Dmela]
):0.048093[&&NHX:D=N:S=33316:B=11],
1028-140663_Celeg:1.12291[&&NHX:S=Celeg]
):0.573014[&&NHX:D=N:S=33213:B=93],
(((((1028-116416_Cinte:0.319152[&&NHX:S=Cinte],
(1028-194478_Hsapi:0.162389[&&NHX:S=Hsapi],
1028-303129_Hsapi:0.198621[&&NHX:S=Hsapi]
):0.076384[&&NHX:D=Y:SIS=100:S=Hsapi:B=96]
):0.038966[&&NHX:D=N:S=7711:B=95],
1028-57382_Dmela:0.417076[&&NHX:S=Dmela]
):0.151526[&&NHX:D=N:S=33316:B=95],
1028-57790_Celeg:0.701653[&&NHX:S=Celeg]
):0.20182[&&NHX:D=N:S=33213:B=78],
1028-27646_Scere:0.497212[&&NHX:S=Scere]
):0.176098[&&NHX:D=N:S=33154:B=32],
((((((1028-398505_Osati:0.000465[&&NHX:S=Osati],
1028-475528_Oindi:0.045167[&&NHX:S=Oindi]
):0.058351[&&NHX:D=N:S=4530:B=100],
1028-191699_Bdist:0.076116[&&NHX:S=Bdist]
):0.016896[&&NHX:D=N:S=359160:B=88],
(1028-328764_Zmays:0.030863[&&NHX:S=Zmays],
1028-25635_Sbico:0.024308[&&NHX:S=Sbico]
):0.046267[&&NHX:D=N:S=147429:B=100]
):0.220042[&&NHX:D=N:S=4479:B=88],
(1028-343613_Ptric:0.060576[&&NHX:S=Ptric],
(1028-126979_Vvini:0.212332[&&NHX:S=Vvini],
1028-126996_Vvini:0.355476[&&NHX:S=Vvini]
):0.073755[&&NHX:D=Y:SIS=100:S=Vvini:B=59]
):0.028647[&&NHX:D=N:S=91827:B=1]
):0.01255[&&NHX:D=N:S=3398:B=0],
(1028-210503_Alyra:0.012498[&&NHX:S=Alyra],
1028-217511_Athal:0.016996[&&NHX:S=Athal]
):0.154389[&&NHX:D=N:S=3701:B=100]
):0.000276[&&NHX:D=Y:SIS=0:DD=Y:S=3398:B=0],
(1028-123127_Ptric:0.024917[&&NHX:S=Ptric],
1028-127008_Vvini:0.292555[&&NHX:S=Vvini]
):0.019968[&&NHX:D=N:S=91827:B=23]
):0.155583[&&NHX:D=Y:SIS=22:S=3398:B=1]
):0.698507[&&NHX:D=N:S=2759:B=14]
):0.891635[&&NHX:D=Y:SIS=29:S=2759:B=85],
(1028-211056_Cinte:2.8126[&&NHX:S=Cinte],
1028-123133_Ptric:1.89922[&&NHX:S=Ptric]
):0[&&NHX:D=N:S=2759:B=88]
)[&&NHX:D=Y:SIS=14:S=2759:B=0];

5. Summarizing the previous treebest testings, it looks like treebest is very strict with the format of species name, which should be similar to Swiss-prot name with no more than 5 char. This is hard for ncbi-taxonomy since it has tens of thousands of species. To encode all those species name into 5 char space, one solution I could think of is using hexidecimal encoded taxonomy Ids.  In the mean time, I learned Stephen Smith has used NCBI taonomy tio generate species tree before (using biosql) and is developing a new method to generate newick tree from ncbi taxonomy file. And he has promised to tell me more later.

6. Moving on to PrIMETV. In summary, I have installed it on a local sandbox, couldn't succeed in running reconcile on command line even with really simple data (copied from PrIMETV manual). 

     Installation is a little tricky, first plotutils 2.4.1 is too outdated to compile on a 64-bit machine with a 64-bit compiler. The Solution is copying over both config.guess and config.sub from plotutils-2.6 to plotutils-2.4.1 before you do     ./configure --enable-libplotter. The 2nd issue is you need to do type conversion in primetv.cc,  change line 354 from  DrawTree_time dt(sv, gv, format, outfile);  to  DrawTree_time dt(sv, gv,(char *)format, outfile);

 reconcile -h
Usage: reconcile <gene tree> <species tree> [<gene-species map>]
This program outputs the reconciliation that minimizethe number of duplications


reconcile G2mod.nhx S2mod.nw G2S2_species_mapping
Error:
    TreeIO::extendBeepTree(...):
    No branch length info found either in 'BL' or 'NW'

 cat G2mod.nhx S2mod.nw G2S2_species_mapping
((Q99L54_MOUSE,Q9BVK4_HUMAN),RLA0_YEAST);
((MOUSE:0.0110,HUMAN:0.0110):0.1466,YEAST:0.1576);
Q99L54_MOUSE    MOUSE
Q9BVK4_HUMAN    HUMAN
RLA0_YEAST      YEAST


I have emailed the author Lars Arvestad (arve@csc.kth.se) asking for some example data files that worked on command line execution of reconcile to diagnose the problem (data issue or software issue), but haven't heard back from him.