NINJA

NINJA

Accelerated engagement to deploy NINJA for Neighbor-Joining trees with large data sets.
The contact for this project is Travis Wheeler

NINJA

  • Web site...

  • Implementation of Neighbor-Joining algorithm.

  • Written in Java, no major external dependencies.

  • Can do a 220K species tree in six days on one CPU, 2Gbytes RAM

  • Test case was 16s rRNA gene sequence data set, the same one as FASTTREE

  • NINJA offers the full Neighbor Joining method but is optimized for large data sets.

    • Employs filtering steps to minimize the number of pair-wise comparisons required for each iteration.

    • Relies on file I/O for externalization (ie query-optimized serialization) of data structures to free up RAM.

  • Uses external application (FASTTREE) to pre-process sequence data into a distance matrix. The file I/O involved is non-trivial.

Requirements

  • Completion of the software by adding sequence pre-processing (distance matrix calculation)

  • Benchmarking/testing on large data sets

  • What is really wanted is to port the algorithm to C or C++ and make parallelizable

Notes

  • Travis thinks the code is written in such a way that it could be easily ported to C. There are no serious external dependencies.

  • Is reliance on disk I/O to free up memory is a potential problem for large-scale parallel computing?

    • (edit from Travis: I don't believe this will be an impediment to parallelization, as the data structures are well suited to being handled in a distributed fashion. The core difficulty will be in setting those structures up in the first place ... which I think won't be terribly challenging)

  • Calculation of pair-wise distance matrix could be parallelized without too much pain, perl wrapper (for example) for matrix assembly would be the most challenging part, but not that hard.

    • (edit from Travis: I'd propose that distance computation should be done internal to NINJA, to avoid the I/O overhead of writing the distance matrix out to disk, then reading it back in for tree construction. To give a sense of scale, for 200,000 sequences, that distance matrix stored in an ascii file will require a file larger than 200GB)

  • Travis Wheeler is available to convey domain knowledge and design input but does not have time to be directly involved in coding

    • (edit from Travis:  I'm able to do some coding, just not most of it)

  • Cubic growth in run-time predicts a doubling of the data set (to 440K) would require approx ~~40 days on one CPU

    • That would be challenging for parameter tweaking

  • No explicit policy for missing data

  • Would be best to use the same sequence data as RaXML.

Draft plan

Phase 1

1) port NINJA from Java to C/C++

  • probably C++ as the binheaps used in the Java version are also in STL in C++

    2) import distance matrix calculation functionality

    3) test/benchmark C++ vs. Java implementation

Phase 2

scale-up and parallelization

note: Travis has posted a page with ideas/suggestions for the details of parallelization.