Accelerated engagement to deploy NINJA for Neighbor-Joining trees with large data sets.
The contact for this project is Travis Wheeler

NINJA

Reference...
Web site...
Implementation of Neighbor-Joining algorithm.
Written in Java, no major external dependencies.
Can do a 220K species tree in six days on one CPU, 2Gbytes RAM
Test case was 16s rRNA gene sequence data set, the same one as FASTTREE
NINJA offers the full Neighbor Joining method but is optimized for large data sets.
- Employs filtering steps to minimize the number of pair-wise comparisons required for each iteration.
- Relies on file I/O for externalization (ie query-optimized serialization) of data structures to free up RAM.
Uses external application (FASTTREE) to pre-process sequence data into a distance matrix. The file I/O involved is non-trivial.

Requirements

Completion of the software by adding sequence pre-processing (distance matrix calculation)
Benchmarking/testing on large data sets
What is really wanted is to port the algorithm to C or C++ and make parallelizable

Notes

Travis thinks the code is written in such a way that it could be easily ported to C. There are no serious external dependencies.
Is reliance on disk I/O to free up memory is a potential problem for large-scale parallel computing?
- (edit from Travis: I don't believe this will be an impediment to parallelization, as the data structures are well suited to being handled in a distributed fashion. The core difficulty will be in setting those structures up in the first place ... which I think won't be terribly challenging)
Calculation of pair-wise distance matrix could be parallelized without too much pain, perl wrapper (for example) for matrix assembly would be the most challenging part, but not that hard.
- (edit from Travis: I'd propose that distance computation should be done internal to NINJA, to avoid the I/O overhead of writing the distance matrix out to disk, then reading it back in for tree construction. To give a sense of scale, for 200,000 sequences, that distance matrix stored in an ascii file will require a file larger than 200GB)
Travis Wheeler is available to convey domain knowledge and design input but does not have time to be directly involved in coding
- (edit from Travis: I'm able to do some coding, just not most of it)
Cubic growth in run-time predicts a doubling of the data set (to 440K) would require approx ~~40 days on one CPU
- That would be challenging for parameter tweaking
No explicit policy for missing data
Would be best to use the same sequence data as RaXML.

Draft plan

Phase 1

1) port NINJA from Java to C/C++

probably C++ as the binheaps used in the Java version are also in STL in C++

See the attached NINJA_Java_Code for source code and notes
2) import distance matrix calculation functionality

A possible place to obtain code for this is FastTree
Probably better is fastdist (http://www.nada.kth.se/~isaac/publications/fastdist/fastdist.html)
3) test/benchmark C++ vs. Java implementation

Phase 2

scale-up and parallelization

note: Travis has posted a page with ideas/suggestions for the details of parallelization.