e-mail

Begin forwarded message:

From: "Stacey D. Smith" <sdsmith@unl.edu>
Date: July 15, 2011 12:10:45 PM EDT
To: Josh Stein <steinj@cshl.edu>
Cc: "Lyons, Eric H - (ericlyons)" <ericlyons@email.arizona.edu>, Eric Lyons <elyons@iplantcollaborative.org>, Matthew Vaughn <vaughn@tacc.utexas.edu>, Stephen Goff <sgoff@iplantcollaborative.org>, "Barthelson, Roger A - (rogerab)" <rogerab@email.arizona.edu>, "Merchant, Nirav C - (nirav)" <nirav@email.arizona.edu>
Subject: Re: tools for gene annotation

Hi Josh,
Yes it would be great if you could provide us with updates on your project. It sounds very relevant to what we are trying to do (and I am sure there are many users like us, although probably aiming to compare to their transcriptomes to different references other than tomato). As you allude to, we had originally hoped to try out different assembly methods and carry the uncertainty in the assembly down through the annotation (thinking that the annotation would help us to judge the quality of the assembly). However, our in-house assembly did not come out as nicely as the Newbler assembly that the 454 sequencing facility made for us, so we are sticking with that one for now.
Cheers,
Stacey

On Fri, Jul 15, 2011 at 10:51 AM, Josh Stein <steinj@cshl.edu> wrote:
Hi Stacey,

I am working on a project within iPlant to enable the kind of analysis you described. The idea is to use a translated blast (BLASTX) to align transcriptome assemblies to protein databases, and to use the output both for the purpose of annotation and to evaluate assembly quality so that users can can compare different assembly methods. We hope to have a working prototype this summer and would like to engage potential users to test and provide feedback. At this point the scope is limited to providing annotation associated with the blast database (i.e. the fasta description line) rather than consulting auxiliary data such as GO, but it's a start. As for the blast database we are currently using the plant and plastid divisions of RefSeq since it is non-redundant, well-curated, and has good representation of plant taxa and complete genomes. However other databases such as for tomato and potato could be made available.

Perhaps I can contact you in the future as this project matures?

Best regards,
Josh

Joshua Stein, PhD
Manager, Sci. Informatics III
Cold Spring Harbor Laboratory
steinj@cshl.edu http://ware.cshl.org/

On Jul 14, 2011, at 5:57 PM, Barthelson, Roger A - (rogerab) wrote:

Hi Stacey-

Your email was referred to me and since I have dealt with similar questions, I thought I might see if I can help. The website you mentioned in your email has a lot of good information. It looks like a good place to start. For the easiest job of identifying your transcripts and determining what sorts of genes they represent, the unigene sequences in the ftp section of the website are great: ftp://ftp.solgenomics.net/unigene_builds/. They are generous about the information they include in the .seq files, which are fasta formatted, although the gene name lines are huge -- way beyond the standard. Which is good if you use a program that preserves that. It’s been a while since I’ve used blast, but I think it may not be too sensitive to fasta name length. You would have to install blast locally. If you don’t have it, it is available here: ftp://ftp.ncbi.nih.gov/blast/executables/blast+/LATEST/. You will need to create an indexed database from the unigene sequences for blast. You can also use blat, which I often use. I know it will tolerate long fasta name lines and preserve them in the output, along with a lot of other useful data. On the other hand, blast scores are pretty standard and recognized. For blat you could use the --fastMap setting if you don’t want to worry about splitting the transcripts for matches. In any case blat is available here: http://users.soe.ucsc.edu/~kent/.

The unigene fasta name lines for tomato may include enough GO annotation information for your purpose, but if you want to do it more formally. The solgenomics.net ftp section has annotation files (manual annotations) for tomato ( ftp://ftp.solgenomics.net/manual_annotations/). This is just a tabular file so you would need to use a script such as you find here: http://sysbio.harvard.edu/csb/resources/computational/scriptome/UNIX/Tools/Choose.html. Or you could use vlookup in excel probably, depending on what you prefer.

This should get you started, but if you want to do it with more finesse, you may try AutoFACT that Nirav describes below. It looks very useful for what you are doing, but it would require more than a blast install. However you do it, you may find you want to compare results from blasts against multiple species. If you try this, there are unigene sequences for multiple related species, and I think there may be a combined sequence, too. You could use all of these in a large blastdb or set them up in individual databases and use AutoFACT to run the comparison. I haven’t used this myself, so I can’t guarantee how well it actually performs!

Feel free if you need more guidance on any of this. I will be glad to help and to hear of your progress.

Roger Barthelson

On 7/14/11 1:12 PM, "Stephen Goff" <sgoff@iplantcollaborative.org> wrote:

I'd recommend copying Roger on this message string too.

On Jul 14, 2011, at 1:09 PM, Nirav Merchant wrote:

Hi Team (just cc-ing iPC folks),
This is a fairly common usage request we see in my facility at UA, we have used tools like Fatigo+, AutoFACT (perl/commandline) http://www.biomedcentral.com/1471-2105/6/151 http://megasun.bch.umontreal.ca/Software/AutoFACT.htm
for similar use.
I am sure we are close to the getting the foundational API job submission to be available for testing but tools like these need database prep work and hence would require some setup by us before user can submit jobs.

It would be worthwhile to consider AutoFACT style tool available in the DE for users to submit data to.

Last year I had setup autoblast with Ranger where you could build and submit data (it would construct the BLAST databases and run you query sequence against and return results using the queue system) using MPIblast which scaled well. May be we need to consider something similar associated with our "Data services" i.e keep a upto date collection of blast DB for general purpose use (we do that at UA) may be some integration with COGE ?

I am sure Matt and Josh have more specific recommendations

Please let me know if I can be of assistance

Regards,
Nirav

On Thu, Jul 14, 2011 at 11:51 AM, Eric Lyons <ericlyons@email.arizona.edu> wrote:
Hi Stacey,

Generally, I believe we have technology that will help with your problem on annotating transcripts. While my expertise is in genome structure and evolution, there is a group in iPlant working actively on transcriptome assemblies and annotations that will be better informed for answers to your questions. I've cc'ed them on this email to see if they can provide some advice (Josh and Matt), along with a couple of other people in iPlant that may know of some additional resources. Please let me know if you need any additional help.

Best,
-eric

---------------------------------------------------
Eric Lyons, Ph.D.
Comparative Genomics
Researcher/Senior Scientific Developer
Bio5 Institute, iPlant Collaborative

Adjunct Faculty
Department of Plant Sciences
University of Arizona, Tucson

Phone: 520-626-5070 <tel:520-626-5070>
Email: ericlyons@email.arizona.edu
SkypeID: ericlyons

On Jul 14, 2011, at 10:49 AM, Stacey D. Smith wrote:

Hi Eric,

I heard from a friend at BSA about your iPlant presentation, and the tools you all are developing. He suggested you might be building infrastructure to help with the sort of research we are doing. So I'm writing to tell you what we are trying to do. At the moment we are working to annotate a floral transcriptome we recently built de novo from 454 data for a non-model Solanaceae (Iochroma to be specific). As I'm sure you know the tomato genome has been released (http://solgenomics.net/) and thus should be a great tool to help us. It's about 25 million years diverged from Iochroma. Essentially we would like to blast each of our transcripts to tomato and pull the gene annotation and any GO terms for the top hit. We have been struggling to figure out how to do this. My friend (Ben Blackman) thought that perhaps iPlant would be putting genomes like tomato onto a server and would have built in Blast tools so that we could do the annotation through iPlant. Is that the case? The alternative (from what I can gather) is to download the appropriate files to build a blast-able database on a local server.
We look forward to hearing what is in the works at iPlant!

Best,

Stacey

Stacey D. Smith
314 Manter Hall
School of Biological Sciences
University of Nebraska
Lincoln, NE 68588-0118
phone: 402-472-6741 <tel:402-472-6741>
phone with voicemail: (402) 370-6749 <tel:%28402%29%20370-6749>
email: sdsmith@unl.edu
website: https://sites.google.com/site/iochromas/

______________________________________________________
Roger Barthelson Ph.D.
Bioinformatics Analyst
iPlant Collaborative
BIO5 Institute, University of Arizona
Phone: 520-977-5249
Email: rogerab@email.arizona.edu
Web: http://web.me.com/rbarthelson/Roger_Barthelson_PhD
______________________________________________________

Joshua Stein, PhD
Manager, Sci. Informatics III
Cold Spring Harbor Laboratory
steinj@cshl.edu http://ware.cshl.org/

--
Stacey D. Smith
314 Manter Hall
School of Biological Sciences
University of Nebraska
Lincoln, NE 68588-0118
phone: 402-472-6741
phone with voicemail: (402) 370-6749
email: sdsmith@unl.edu
website: https://sites.google.com/site/iochromas/

Joshua Stein, PhD
Manager, Sci. Informatics III
Cold Spring Harbor Laboratory
steinj@cshl.edu http://ware.cshl.org/