Assembly_20101029

Missouri Brassica analyses on iPlant TeraGrid/iG2P collaboration
October 29, 2010; 12:30 to 1:30 pm EDT

Present: Matt Vaughn, Eric Lyons, Chris Pires, Pat Edger, Karla Gendler, Steve Goff

Action Items:

Pat Edger will try to log onto Ranger and load data over coming week, and start analyses in few weeks (or Matt help him get on)
Scott and Pat work with Matt to load Velvet, Abyss and other tools onto Blacklight
Matt Vaughn will try to use Discovery virtual memory in late November/early December once we have kicked the tires on Ranger for awhile and have some datasets for Matt to use
Pires will visit Arizona right after Thanksgiving (and visit campus Nov 29-30) and will meet Gendler, Lyons, Goff and others then

Notes/Agenda:

Project Objective
- Genome assembly of Brassica oleracae, B. rapa, and related species
- Assemble one at a time
- Urgent priorites are de novo transcriptomes and genomes
Resources
- Ranger (TACC)
  - 62k cores, robust storage, cores interconnected with MPI
  - has one 256GB node and lots of small nodes
  - much better tuned for running abyss
  - with queue system, if starts swapping job will get killed
  - Ranger has had some downtime due to power outages
  - assembly packages will be available mid next week
  - instead of extending path, etc, there’s a command called “module” that allows all tools to be packaged together and pushed into your environment
  - can stage all of your data right now
  - 350GB storage is somewhat flexible
  - Ranger has 3 files systems: home (relatively small), work (keep data), scratch (do all intermediate work/analysis here)
  - Pat should try to log in to Ranger to see if account is all set up
    - for module support, will contact when ready
    - can submit test jobs using iPlant project name;
  - Matt has used OASIS but it is a driver for Velvet
- Blacklight (Pittsburgh Supercomputing Center)
  - 4 TBs of shared memory
  - understanding is that can get access to 4TB of memory (not the full 32 as this creates architecture issues)
  - Early-user support as Blacklight not ready for production
    - means it can crash, lose data, and can be buggy
    - can be used for exploration as there are much fewer rules as on production
    - can run into other users as scheduling system not as locked down
    - still a research grade piece of equipment
  - Major rule
    - If have issue or complaint, need to keep it to the working group
      - consider it confidential; just don’t want stuff going to outside
  - account on Blacklight has been approved as of this morning (Scott best person to contact, he’s new bioinformatics core director)
  - Matt needs to get a hold of all the documentation to pass on
  - none of the applications that you need will be built on Blacklight, will need to build and install ourselves (ok, with MV if build it themselves)
- DASH (UCSD)
  - do not currently have an allocation but may come through
  - Scale NP is the “virtualized” version, not “real RAM” which is more likely what we need, so starting on Blacklight is good.
- NAUTILUS (Oakridge)
  - same architecture as Blacklight but barely functional
  - Scale-MPI
- DISCOVERY (TACC)
  - research cluster that has a memory virtualization
  - has about 300GBs of memory
  - could move to another set of nodes
  - hoping that once get dataset staged into Ranger, Matt would like to test on Discovery
    - one set of data that should require more than 250GB but less than 512GB to see what performance is like
  - will allow scaling across nodes
work with Matt
very interested as when can provide this as a commodity service later in the life of the project;
have been working with JCVI and their tool seems to be working well
- can make all raw data public and could have grand challenges around assembly and annotation
- can’t get raw data from BGI
- process of writing up a paper on what gives best transcriptome
- rate limiting step is definitely memory
every large allocation has genomic/assembly and RaXML and Windjammer/NINJA;
have been barcoding with aim of what does $500 or $1k genome look like? Have one person using reference reads and creating annotation pipeline;
would apply to TeraGrid for a non-startup allocation once exceed 250k hours, have to apply and is done competitively; this is not a cost recovery
aim for trying to play around with data around November 20
Long term goal: transcriptome/genome assembly for dummies
- high performance discovery environments or always command line?
- if use case for average grad student to say "I have 1/2 TB of data, than it needs to be user friendly"