Data Assembly workflow

Workflow produced by Sharon Wei with information from Stephen Smith and Gordon Burleigh

https://pods.iplantcollaborative.org/wiki/download/attachments/3871315/DA-workflow-v1.ppt

Stephen Smith's pipeline in more detail

NCBI green plants downloads, GBPLN files from ftp site => load to mysql db using biosql => configuration file, towards a specific clade, for each gene of interest (for example rbcl, which doesn’t have many paralogus), collect several sequences from representative species => blastn the genebank sequences against these bait sequences, pick the best one for each species in the clade => do multiple alignments using MAFFET => if too crappy, split into the lower level of the taxonomy tree => for each lower level clade, do alignments => iterate until the alignments look good => to join these subset alignments together do a profile alignment using MUSLE => then do repeat the process for the other genes G1-Gn, => concatenate with phynexus?phyutility? => feed to RAXML

The data from genbank referring to the gbpln*.seq.gz in ftp://ftp.ncbi.nih.gov/genbank/
His pipeline is fully functional and in current development and refinement.
He also has an automated version for marching through the plant tree and making trees for each of those clades.

Input Alignment & Input Tree Formats for RAxML

  • The input alignment format of RAxML is relaxed interleaved or sequential PHYLIP. “Relaxed” means that sequence names can be of variable length between 1 up to 256 characters. If you need longer taxon names you can adapt the constant #define nmlngth 256 in file axml.h appropriately. Moreover, RAxML should be less sensitive with respect to the formatting (tabs, insets, etc) of interleaved PHYLIP files.
  • The input tree format is Newick (see http://evolution.genetics.washington.edu/phylip/newicktree.html), the RAxML input trees must not be comprehensive, i.e., need not contain all taxa.

There is a model file for the genes along with the phylip file.

The model file is the file that RAxML would use to denote partitions in the datamatrix. This would look something like

DNA, nameofpartition = 1-1000
DNA, nameofpartition2 = 1001-2000

This would be for two partitions where the phylip file would be the result of concatenating two gene regions together (each with 1000 base pairs).

Major modules might be developed in DE:

  1. modules to retrieve/sync data from data sources (currently from genbank only)
  2. modules to do querying working set/doing blast/preparing homologus clusters (ex. all2all blast vs baited blast)
  3. modules to do MSA/QC/refinement (ex. MAFFET, Kalign suggested by Nirav)

Challenges

  1. Build a central databases to be synchronized with genebank plant sequences, how to mark bad sequences so that the bad sequence won’t be propagated
  2. Multiple sequence alignment (involving breakdown to lower level clade, manual inspection)

Comments on the workflow

From Nirav Merchant

My group has tried running some large MSA on Terragrid resources (not at TACC) and we were running into limitations of maximum allowed RAM.
We started with muscle and eventually settled on Kalign. Susan Miller from my group did most of the work, will be happy to assist with this task if needed.

We also had looked at using Pegasus (and a scaled down version of Condor DAGman) to do such workflow's on a regular basis so that we can deal with exceptions without having to rerun the complete workflow component.

From Casey Dunn

  • Calculating the pairwise sequence similarities and generating clusters from these results will quickly become a bottleneck. Baited blasts are a good way to simplify the problem, but there are a few potential issues. First, it is not capable of picking up taxon-specific genes that are not included in the bait set. This may or may not be a problem, it depends on how narrowly we define our objectives. Second, anecdotally I can say that the clustering gives better results when all possible comparisons are made. I suggest that we look into some of the other tools that are now available for these tasks (eg http://www.drive5.com/uclust/). There has been a lot of comp sci work on partitioning datasets, making comparisons and clusters on the subsets, and then puzzling a supergraph together. It could be interesting to look more into those. Another possibility would be a hybrid approach- a one time or infrequent all v all, isolation of exemplar sets from these, and use of the exemplars as bait on more frequent sequence addition events.
  • The "Define ortholog sets" step is absolutely critical and can be implemented in a variety of ways. I would advocate building a tree for each of the gene sets that results from the clustering and doing this evaluation in an explicitly phylogenetic context. The paper that Alexis and I did last summer tackled this, and presented a method that can decompose genes for which there are deep gene duplications into multiple orthologous subsets of genes. There have been multiple attempts to resolve orthology/ parology based on the pairwise similarities alone, but phenetic approaches probably aren't as effective as just looking at the trees.

From Stephen Smith

Just to followup what Casey mentioned. Basically, I agree. The baited blast technique is meant for when the gene regions of interest are known the diversity of the gene region is easily characterized. This obviously becomes a problem when either of these things is not the case. That said, I think it is a good technique for one approach toward our goal. I think it would be great to explore some other options (e.g., uclust and hybrid approaches). For example, I have had some great luck with hybrid approaches like this in another context. I am in favor of a diversity of ways of generating these datasets.

In general, I am seconding Dunn's suggesting and think that these are good suggestions to add to the existing ways of getting these datasets.

The ability to have one database that is constantly updated with the latest genbank sequences, but also the ability to associate meta information in order to mask potentially bad sequences would be very helpful.
For masking bad sequences I just mean that genbank has a lot of junk.
One way to correct that is that when the data is presented to the outside world in some nice way, if there is a feedback where they can flag potential problems, then that can be sent to a list that at some point could be curated (by any number of solutions). Once curated, the curator would have decided whether that was indeed a bad sequence then mask so we don't grab it again. It would be in the database still, but would have metadata saying (BAD SEQUENCE) so we could avoid. It could also be misidentified in which case the curator(s) could correctly ID it and fix the data in the database. This is why the central storage is so important so we don't all reretrieve bad things.