Assembly_20101029

Missouri Brassica analyses on iPlant TeraGrid/iG2P collaboration
October 29, 2010; 12:30 to 1:30 pm EDT

Present: Matt Vaughn, Eric Lyons, Chris Pires, Pat Edger, Karla Gendler, Steve Goff

Action Items:

  • Pat Edger will try to log onto Ranger and load data over coming week, and start analyses in few weeks (or Matt help him get on)
  • Scott and Pat work with Matt to load Velvet, Abyss and other tools onto Blacklight
  • Matt Vaughn will try to use Discovery virtual memory in late November/early December once we have kicked the tires on Ranger for awhile and have some datasets for Matt to use
  • Pires will visit Arizona right after Thanksgiving (and visit campus Nov 29-30) and will meet Gendler, Lyons, Goff and others then

Notes/Agenda:

  • Project Objective
    • Genome assembly of Brassica oleracae, B. rapa, and related species
    • Assemble one at a time
    • Urgent priorites are de novo transcriptomes and genomes
  • Resources
    • Ranger (TACC)
      • 62k cores, robust storage, cores interconnected with MPI
      • has one 256GB node and lots of small nodes
      • much better tuned for running abyss
      • with queue system, if starts swapping job will get killed
      • Ranger has had some downtime due to power outages
      • assembly packages will be available mid next week
      • instead of extending path, etc, there’s a command called “module” that allows all tools to be packaged together and pushed into your environment
      • can stage all of your data right now
      • 350GB storage is somewhat flexible
      • Ranger has 3 files systems: home (relatively small), work (keep data), scratch (do all intermediate work/analysis here)
      • Pat should try to log in to Ranger to see if account is all set up
        • for module support, will contact when ready
        • can submit test jobs using iPlant project name;
      • Matt has used OASIS but it is a driver for Velvet
    • Blacklight (Pittsburgh Supercomputing Center)
      • 4 TBs of shared memory
      • understanding is that can get access to 4TB of memory (not the full 32 as this creates architecture issues)
      • Early-user support as Blacklight not ready for production
        • means it can crash, lose data, and can be buggy
        • can be used for exploration as there are much fewer rules as on production
        • can run into other users as scheduling system not as locked down
        • still a research grade piece of equipment
      • Major rule
        • If have issue or complaint, need to keep it to the working group
          • consider it confidential; just don’t want stuff going to outside
      • account on Blacklight has been approved as of this morning (Scott best person to contact, he’s new bioinformatics core director)
      • Matt needs to get a hold of all the documentation to pass on
      • none of the applications that you need will be built on Blacklight, will need to build and install ourselves (ok, with MV if build it themselves)
    • DASH (UCSD)
      • do not currently have an allocation but may come through
      • Scale NP is the “virtualized” version, not “real RAM” which is more likely what we need, so starting on Blacklight is good.
    • NAUTILUS (Oakridge)
      • same architecture as Blacklight but barely functional
      • Scale-MPI
    • DISCOVERY (TACC)
      • research cluster that has a memory virtualization
      • has about 300GBs of memory
      • could move to another set of nodes
      • hoping that once get dataset staged into Ranger, Matt would like to test on Discovery
        • one set of data that should require more than 250GB but less than 512GB to see what performance is like
      • will allow scaling across nodes
  • work with Matt
  • very interested as when can provide this as a commodity service later in the life of the project;
  • have been working with JCVI and their tool seems to be working well
    • can make all raw data public and could have grand challenges around assembly and annotation
    • can’t get raw data from BGI
    • process of writing up a paper on what gives best transcriptome
    • rate limiting step is definitely memory
  • every large allocation has genomic/assembly and RaXML and Windjammer/NINJA;
  • have been barcoding with aim of what does $500 or $1k genome look like? Have one person using reference reads and creating annotation pipeline;
  • would apply to TeraGrid for a non-startup allocation once exceed 250k hours, have to apply and is done competitively; this is not a cost recovery
  • aim for trying to play around with data around November 20
  • Long term goal: transcriptome/genome assembly for dummies
    • high performance discovery environments or always command line?
    • if use case for average grad student to say "I have 1/2 TB of data, than it needs to be user friendly"