Progress_report

Progress_report

iPlant Tree of Life (iPToL) Grand Challenge Project Progress Report

In May 2009, the iPToL Grand Challenge Project Kickoff meeting was held at NESCent to establish specific areas of focus and to develop a high-level implementation plan for the iPToL project. From this meeting, it was decided to divide the grand challenge into six focal areas, each corresponding to a specific working group. Leads and co-leads for each working group were recommended by consensus and working group membership was drafted from both meeting participants and non-participating community members. During the kickoff meeting, support was expressed for two other possible collaborations: upgrading the Angiosperm Phylogeny Web Site (APWEB2) and the Botanical Information and Ecology Network (BIEN). By consensus, these were determined not to be within the scope of the core iPToL budget but a desire to support these efforts was agreed. In the fall of 2009, new collaborations were established with these groups, with additional funding added to the iPToL budget. The APWEB2 project is regarded as an Education Outreach and Training (EOT) extension of the iPToL project. The BIEN collaboration has now become part of iPToL and is meant as an expanded solution to the pervasive "taxonomic intelligence" problem (described below) and will also help to integrate ecological trait data with iPToL’s discovery environments.

Working Groups

Big Trees: Scaling up Phylogenetic Inference

Phylogenetic analysis at the level of several hundred thousand species represents a new scalability challenge that iPlant is addressing on two parallel tracks. The general approach is to optimize existing methods (using Maximum Likelihood with RAxML and Neighbor-Joining with NINJA/WINDJAMMER). The current testing data set is a matrix of eight genes for 116K species provided by Stephen Smith. The necessity of using very large data matrices and measuring uncertainty using bootstrap replicate analysis make scaling up to the ultimate goal of building phylogenetic trees for up to 500,000 species a formidable challenge that requires high performance computing (HPC) methods to be able to perform and update the analysis in tractable amounts of time. Data for the full 500,000 green plants are not yet available, though this is the mission of the data assembly working group, described below. When the data become available, we wish to have the infrastructure already in place to cope with them. One area of concern in the big tree groups was that outward communication has been limited. We have now established regular meetings between the TACC members of the big tree group with the external collaborators to improve communication and ensure user requirements are being met.

RAxML

RAxML was developed by Alexandros Stamatakis. John Cazes and B. D. Kim, members of the iPlant engagement team at TACC have been working to scale up the application to be able to leverage the substantial TACC high performance computing resources.

The first major achievement for this work is the addition of check-pointing to allow stopping and restarting of long-running processing jobs without loss of data or compute time. Ongoing work is being done to improve parallel implementation and to decrease overall run-time for large data sets. The current version of RAxML being used at TACC is the Pthreads (POSIX Threads Programming) version. For optimal performance on TACC’s HPC platform, an MPI (Massive Parallel Interface) version must be improved. The approach being taken is to break down RAxML to components to identify and test the parts that require rewriting for MPI, rather than the less efficient approach of refactoring the entire program. In order the expedite this process, the RAxML group has added Frank Willmore, who is experienced in Pthreads.

The group is implementing MPI in the latest version of RAxML using version 7.04 (which supported MPI) as a guide. This implementation is being performed in incremental steps.

The first test case, A33, which runs under RAxML v7.04 and RAxML v7.2.7. By following the execution of A33 through the older code and the current code, they are able to determine which of RAxML's many functions were used for this test case. We are creating MPI versions of these functions to create a prototype MPI code that will run A33. We plan to compare the results of this prototype code with the current Pthreads code.

Regular meetings will start being held in late October with developers John Cazes, B. D. Kim, and Frank Willmore, scientific collaborator Alexandros Stamatakis, product manager Michael Gonzales, Sheldon McKay and other interested parties.

NINJA/WindJammer

Neighbor-joining is a hierarchical clustering method for inferring phylogenies. NINJA, developed by Travis Wheeler, is a new tool that produces correct neighbor-joining trees much faster than the canonical algorithm; it is able to scale to inputs of the size needed for assembling the “tree of life” for all green plants, but would require months and an very large amount of disk storage space to do so on a single computer.

The goal of this collaboration is to build an MPI implementation of the core algorithm from NINJA, to allow biologists to analyze very large data sets in a matter of hours instead of months. The end product is a new implementation of this clustering method, code named WINDJAMMER and entirely re-written in C. The original version was in Java.

Progress on WINDJAMMER is promising. The original version of NINJA can run a data set of 218,348 species in approximately six days. Recent benchmarking of the MPI version of WINDJAMMER can run the same data set in approximately eight hours. This is a memory intensive computation that involves formidable resources, over 2000 CPUs in this case, because all memory must be in internal RAM. TACC is well equipped to handle such cases.

More recent single CPU benchmarking shows modest speedup for WINDJAMMER versus NINJA using only internal memory (WINDJAMMER does not use external memory). The size of the test cases was limited to up to 20K taxa due to the limitations of available RAM. Because it can take advantage of disk space to externalize memory, NINJA is more suitable for single-CPU usage for larger data sets.

However, multi-processor benchmarking of Windjammer and NINJA (using externalized memory on a single CPU) has WINDJAMMER 23 times faster on the 53K taxa case and 32 times faster on a 218K case. Here the difference is that WINDJAMMER is running strictly in RAM and in parallel and NINJA is on one processor and writing to disk.

Work is ongoing to increase efficiency and add sequence pre-processing to distance matrices, a capability that the original NINJA lacks and one of the key requirements of the collaboration.

Bi-weekly meetings have been established for Robert McLay (developer), Travis Wheeler (scientific collaborator), Michael Gonzales (product manager), Sheldon McKay and other interested parties. Travis and Robert are working together to develop an MPI approach to distance calculation for large sequence data sets.

An extended abstract has been submitted to the ICCABS meeting in Orlando, Florida, Feb. 3-5, 2011 on work describing WINDJAMMER and its performance.

Update: Distance Matrix Calculation

Using the 218K case and the MPI interface of Windjammer, it is now possible to read in the aligned sequence data (in Fasta format, 270 MB) and generate the distance matrix in parallel in 2 minutes (120.7 seconds). It can write out the generated matrix in about 12 minutes versus reading in the matrix for tree building in 15 minutes. Windjammer takes advantage of a Lustre file system for fast writing of the large matrix file to disk.

Windjammer can now compute the distance matrix directly from both protein and sequence alignments. Previously, it took 2-3 days to generate the distance matrix for the 218K case with FASTTREE (http://www.microbesonline.org/fasttree/). The generated distance matrix required 400GB of disk storage.

Windjammer support three pair-wise distance calculation dna sequences: basic, and Jukes-Cantor and Kimura two parameter methods for multiple substitutions. Windjammer can also calculate protein distance matrices usuing the BLOSUM45
matrix.

Tree Reconciliation

The primary goal of the Tree Reconciliation (TR) working group, led by Todd Vision at NESCent, is to develop infrastructure facilitating the reconciliation of gene trees and species trees and making these tools and data available to plant scientists. This working group will also publish reconciliations generated as part of the development process.

Ongoing development in collaboration with the Tree Reconciliation working group and the Thousand Plant Transcriptome project (oneKP) is focused on using the existing phylogenetic tree to address evolution of gene families in “host” species. Work is underway on an iPlant incubator project, described below, to provide a gene-species tree reconciliation service that will reconcile gene families being generated by the oneKP project with the green plant species phylogeny. The oneKP project is also a potential source on input data for the data assembly and integration working group's efforts to feed into the "big tree" analysis.

The tree reconciliation service prototype was transitioned to an incubator project in order to accelerate development and bring in various team members with appropriate, specialized skill sets to work on different components concurrently. The tree reconciliation service will provide an interactive web portal in the iPlant discovery environment, through which scientists can explore gene clusters (families), gene trees and reconciled gene trees through a variety of entry points such as gene name searches, a BLAST interface and gene ontology (GO) term-based services. The first version will contain pre-computed reconciliations for over 2500 gene families in six examplar species (poplar, grape, cucumber, papaya, soybean and Arabidopsis thaliana). Subsequent releases will include more species and the ability to perform reconciliations "on the fly".

Significant progress has been made with an analytical pipeline that performs multiple sequence alignments for the initial set of 2,541 gene clusters for six species: soybean, papaya, cucumber, poplar, grape and Arabidopsis thaliana. The pipeline then assembles gene trees and reconciles them with the known species tree.

The sequence alignments are performed using MUSCLE from the European Bioinformatics Institute. Then, species tree-guided gene trees are built with TreeBeST (Figure 1). Finally, tree reconciliation and "fat tree" rendering are performed using PrIMETV (Figure 2). Sheldon McKay is also working on an HPC pipeline to use species-tree guided Bayesian analysis of gene trees using primeGSR.


Figure 1 A gene tree produced by TreeBeST using one of the oneKP gene clusters, rendered in PhyloWidget. Red nodes indicate duplication events, blue nodes are speciation events


Figure 2 The gene tree reconciled with the species tree using PrIMETV

The bioinformatics pipeline for gene-species tree reconciliation is complete and the database has been populated with the reconciled trees and the host-tree relationships. Jamie Estile has written a BioPerl TreeIO module for parsing Newick trees extended with PRIMETV annotations and has pushed it back to the BioPerl repository. We are using an extended version of the Ensembl/compara database schema and associated Perl API. Sheldon McKay and Jamie Estile have been working on the bioinformatics and data modeling and members of the core team are working on the user interface and data interface layer.

Species-guided Gene trees and tree reconciliations are pre-computed for the first release, but later iterations will include larger numbers of species and provide the ability to do reconciliations on the fly. Jamie Estile and Dennis Roberts are paired; working on the SQL queries necessary for the database searches and associated changes to the Perl API. A developer preview release has been completed with the following features:

  • Database containing reconciliations for over 2500 gene families in six examplar species (poplar, grape, cucumber, papaya, soybean and Arabidopsis thaliana).

  • GUI with the ability to:

    • Search by Gene Identifier or GO term

    • Perform BLAST searches

    • Visualize reconciled trees

    • Retrieve and visualize speciation and gene duplication events

    • Provide overall statistics and links to alignment and sequence files

A proof of concept was released on March 11th as part of the 0.3 release of the Discovery environment.

Preliminary TR use case

a preliminary TR use case is available on the wiki at:https://pods.iplantcollaborative.org/wiki/display/coresw/Use+Case+Tree+Reconciliation+Version+1

Phase 2 and the Marquis publication

The working group has started discussing the scope for Phase 2. Cooperation with tree visualization group has been intensified with the goal to add interactivity. The initial scoping for this project is available on the wiki at the following location:

https://pods.iplantcollaborative.org/wiki/display/coresw/Discussion of scope for TR version 2.0

This effort involves the coordination of work being done by developers at TACC (Adam Kubach) and the University of Arizona (Andrew Muir and Dennis Roberts )to provide interactions between a gene and species tree. Additional work is being done by James Estill (University of Georgia) to populate the database with the outputs of the pipeline generated by Sheldon McKay for the first project. The goal for this project is to provide a unique view of a gene and species tree reconciliation that will allow for discoveries as they relate to the 1KP datasets.

In addition to the interactive trees being provided with this project, users will be able to search from the species tree for gene families of interest by selecting points of reference from the species tree. The original searches provided will continue to be available as an "advanced search" option.

The development of the TR Phase 2.0 has reached code completion in the 3rd week of February and will undergo UAT during the 4th week. The delivery to the working group is planned to for February 28th. The features present in this release include:

  • Display of species trees and gene trees side-to-side, using the Tree Visualizer developed by the Tree Viz Group.

  • Interactive mapping of duplication and speciation events between gene and species tree and vice versa

  • Markups for speciation and duplication events on the gene tree nodes and of duplication events on the species tree branches.

  • Ability to add additional markups

  • Contextual menus

  • Advanced search functionalities, including** BLAST** GO terms and accessions** Gene ID

  • GO tag clouds for gene families

  • Retrieval of underlying data (sequences and reconciliations)

Work continues on the Phase 2.0 version, which is in advanced state of development. The working prototype has been made available to the working group and their feedback is being integrated. Concurrently, the process of releasing the source code for the TR application is being completed and it's estimated that the front end will be available through GitHub by the end of the third week of March.

The tree reconciliation incubator project team is:

  • User interface - Andrew Muir, Sriram Srinivasan, Dennis Roberts

  • Perl database API extensions - Dennis Roberts

  • Infrastructure - Dennis Roberts and to be determined

  • Data modelling and schema - Jamie Estill, Sheldon McKay

  • Pipeline development/tools- Sheldon McKay

  • Requirements analysis- Andrew Lenards, Nicole Hopkins, Eric Lyons

  • User acceptance test plan- Kathleen Kennedy, Sheldon McKay, others

  • QA of product- Jerry Schneider, Pavithra Ravi, Bansri Poduval

  • Documentation and website content- Matthew Helmke/Sheldon McKay, others?

  • Coordination- Andrew Lenards/Nicole Hopkins/Sonya Lowry/Naim Matasci