


Data intake pipeline

Update on PHLAWD


  1. Establish a robust data pipeline that generates multiple sequence alignments to feed into big species tree generation
  2. Engage iPlant faculty (Nirav Merchant, Sudha Ram, Eric Lyons) for domain expertise in data infrastructure, meta-data management and scientific workflows.


  • Robust data upload capability (IRODS; iPlant-wide requirement)
  • Meta-data management
  • Robust data storage and retrieval (iPlant-wide requirement)
  • Input data validation
  • Multiple sequence alignment generation
    • PHLAWD
    • MORE (?)
    • Muscle, etc
  • Sequence database(s)


  • How will data assembly feed into big trees?
  • How much overlap with onekp?


Action Item: Sheldon will establish contact with Gordon Burleigh, and have him work with John Cazes and Eric Lyons

  • Sharon Wei is on maternity leave, John Cazes will assume responsibilities in her absence
  • The 1kp project whould be a subset of Data Assembly and not independent of it
  • Need a streamlined mechanism to get data into huge matrices to build huge trees. Being able to compile and analyze data to create a tree
  • Need to come up with a good data model. Huge overlap, 1kp and other data storage need to be brought into the discussion. Have all
    data stored in a similar format and tools get developed around that format.
  • Not appropriate to have only one alignment tool availablie, any set of alternatives that can be included would be good
  • RAXml and Big Tree building will communicate with the DA group reporting their activities at the DA meetings
  • Establish more consistent communication with Pam and Doug Soltis