GATC02

Meeting 6/15/2011

Phase scope to create a Transcript Assembly pipeline on a time-line for ASPB

Attendees: Josh, Shiran, follow-up with Matt

Rationale

It gives the biggest bang for the buck in terms of defining gene-space and on a practical level is less intensive computationally, requiring less memory, than genome assembly due to reduced repeat content.

Bare-bones requirements for implementation of a workable demo.

Josh: Abyss is already running on Ranger and Shiran has experience using it on transcript assembly, with reasonably good results. So we would proceed with this. In the background we have a talented URP student who is will be comparing Abyss with Inchworm and Allpaths on the Amazon cloud under Shiran's supervision, which can feed into longer term objectives.

Matt: OK - in 7-10 days, we should be able to wrap and run Abyss using the Foundation API. I am more than happy to expand to other assemblers going forward! I'll ask Roger to make sure that we know how to run the entire abyss-pe workflow on a large memory machine (256 GB to start, though we can move to 1 TB if need be)

Josh: For initial evaluation of FASTQ reads we propose using FastQC (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/) or other open source. This package is attractive because it provides summary graphs and tables. It has been demo'd on Illumina, 454, and BacBio, so I have no reason to think that SOLiD wouldn't also work. It requires a Java Runtime Environment. Is this feasible?

Matt: Java is installed on Ranger - load module java/1.6.0

Matt: I agree 100% about using FastQC up front - so much so that I built a version of it into the DE that renders the HTML+images report into a PDF so the DE can display it. We can definitely build a Foundation API version.

Josh: Read filtering - decGPU. Shiran--What do you currently do?

Matt: I was thinking about doing read correction using decGPU on longhorn, our 512 GPU cluster. But I don't know if we have time to get it running and debugged before ASPB.

Josh: The main Abyss parameter that a naive user needs to consider is Kmer length. Abyss outputs a Kmer frequency table but apparently only as part of the assembly output as opposed to a pre-analysis. If we want a Kmer index analysis prior to running Abyss, for the purpose of parameterization, a tool such as Tallymer could be used (a standard hash script is likely to run out of memory, whereas Tallymer uses suffix arrays).

Matt: My thoughts exactly - we have 32 GB/node RAM on Ranger, and 24 GB/node on Lonestar to work with. Or we could do Tallymer analysis on Longhorn and use the 144 GB nodes. Anyway, we are on the same page. Its my understanding one needs a kmer analysis before doing read correction as well, so this kills a couple of birds with one stone.

Josh: After defining a baseline Kmer the user would then pick 2 below and 2 above a increments of 5, such as 40, 45, 50, 55, 60. We would need an R-script or other for generating a graph of index curves at a given Kmer. Christos is on-board for developing R-scripts.

Matt: Yes, and as long as he parameterizes them the same way he does for the DE, this will work well.

Josh: I would imagine that sequencing strategy would also come into play in terms of Abyss methods. E.g. stranded vs non-stranded, paired-end. Shiran--please fill in.

For post-analysis we can use Abyss Explorer which makes graphical distribution plots (Java).

Matt: Is this an interactive application? If so, the way to use it on outputs from the DE is to fire up a VM in Atmosphere. We could have Abyss explorer installed on the Bioinformatics appliance for this purpose. If it generates static images, we can call it as a shell application and generate reports.

Josh: Annotation: Align to CDS or protein sequences of well-annotated, appropriate, reference species. Blastn/Blastx is an option as well BLAT. We already have parsing scripts for either of these options, and scripts for defining coverage. In addition, scripts to define longest ORF and provide translation. Could also show an 6-frame translation underneath the cDNA sequence with annotation of start and stop codons.

Matt: Showing in the August version of the DE may be hard - we only have the ability to show static
Priority is getting the Kmer sweep + post-assembly annotation working for ASPB - these are what Id say are virtually impossible for people to easily do on their own.