Assembly_20101029
Missouri Brassica analyses on iPlant TeraGrid/iG2P collaboration
October 29, 2010; 12:30 to 1:30 pm EDT
Present: Matt Vaughn, Eric Lyons, Chris Pires, Pat Edger, Karla Gendler, Steve Goff
Action Items:
- Pat Edger will try to log onto Ranger and load data over coming week, and start analyses in few weeks (or Matt help him get on)
- Scott and Pat work with Matt to load Velvet, Abyss and other tools onto Blacklight
- Matt Vaughn will try to use Discovery virtual memory in late November/early December once we have kicked the tires on Ranger for awhile and have some datasets for Matt to use
- Pires will visit Arizona right after Thanksgiving (and visit campus Nov 29-30) and will meet Gendler, Lyons, Goff and others then
Notes/Agenda:
- Project Objective
- Genome assembly of Brassica oleracae, B. rapa, and related species
- Assemble one at a time
- Urgent priorites are de novo transcriptomes and genomes
- Resources
- Ranger (TACC)
- 62k cores, robust storage, cores interconnected with MPI
- has one 256GB node and lots of small nodes
- much better tuned for running abyss
- with queue system, if starts swapping job will get killed
- Ranger has had some downtime due to power outages
- assembly packages will be available mid next week
- instead of extending path, etc, there’s a command called “module” that allows all tools to be packaged together and pushed into your environment
- can stage all of your data right now
- 350GB storage is somewhat flexible
- Ranger has 3 files systems: home (relatively small), work (keep data), scratch (do all intermediate work/analysis here)
- Pat should try to log in to Ranger to see if account is all set up
- for module support, will contact when ready
- can submit test jobs using iPlant project name;
- Matt has used OASIS but it is a driver for Velvet
- Blacklight (Pittsburgh Supercomputing Center)
- 4 TBs of shared memory
- understanding is that can get access to 4TB of memory (not the full 32 as this creates architecture issues)
- Early-user support as Blacklight not ready for production
- means it can crash, lose data, and can be buggy
- can be used for exploration as there are much fewer rules as on production
- can run into other users as scheduling system not as locked down
- still a research grade piece of equipment
- Major rule
- If have issue or complaint, need to keep it to the working group
- consider it confidential; just don’t want stuff going to outside
- If have issue or complaint, need to keep it to the working group
- account on Blacklight has been approved as of this morning (Scott best person to contact, he’s new bioinformatics core director)
- Matt needs to get a hold of all the documentation to pass on
- none of the applications that you need will be built on Blacklight, will need to build and install ourselves (ok, with MV if build it themselves)
- DASH (UCSD)
- do not currently have an allocation but may come through
- Scale NP is the “virtualized” version, not “real RAM” which is more likely what we need, so starting on Blacklight is good.
- NAUTILUS (Oakridge)
- same architecture as Blacklight but barely functional
- Scale-MPI
- DISCOVERY (TACC)
- research cluster that has a memory virtualization
- has about 300GBs of memory
- could move to another set of nodes
- hoping that once get dataset staged into Ranger, Matt would like to test on Discovery
- one set of data that should require more than 250GB but less than 512GB to see what performance is like
- will allow scaling across nodes
- Ranger (TACC)
- work with Matt
- very interested as when can provide this as a commodity service later in the life of the project;
- have been working with JCVI and their tool seems to be working well
- can make all raw data public and could have grand challenges around assembly and annotation
- can’t get raw data from BGI
- process of writing up a paper on what gives best transcriptome
- rate limiting step is definitely memory
- every large allocation has genomic/assembly and RaXML and Windjammer/NINJA;
- have been barcoding with aim of what does $500 or $1k genome look like? Have one person using reference reads and creating annotation pipeline;
- would apply to TeraGrid for a non-startup allocation once exceed 250k hours, have to apply and is done competitively; this is not a cost recovery
- aim for trying to play around with data around November 20
- Long term goal: transcriptome/genome assembly for dummies
- high performance discovery environments or always command line?
- if use case for average grad student to say "I have 1/2 TB of data, than it needs to be user friendly"