SC_20100113

iPG2P Steering Committee Minutes
January 13, 2010; 1 to 6:30 pm PST
San Deigo, CA

Present: Steve Goff, Steve Welch, Bernice Rogowitz, Matt Vaughn, Dan Kliebenstein, Ruth Grene, Greg Abram, Chris Myers, Ed Buckler, Doreen Ware, Jerry Lu, Damian Gessler, Sheldon McKay, Tom Brutnell, Karla Gendler, Chris Jordan, Dave Micklos, Jeff White, Sonya Lowry, Dan Stanzione (remote), Martha Narro (remote)

The meeting was convened at 1pm PST.

Agenda/Presentations:

1:00	Welcome and Goals of the Meeting	The Steves	All sessions in the Laguna Conf Rm unless otherwise noted
1:15	NextGen Sequencing WG report	Tom
2:00	Statistical Inference WG report	Dan K
2:45	Visual Analytics WG report	Ruth, Greg, Bernice
3:30	Break
4:00	Data Integration WG report; Summary of EC-US DB Meeting EC-US DB Meeting Data Integration WG report	Doreen, Chris J
4:45	Modeling Tools WG report Jeff's Presentation Chris's Presentation	Chris M and Steve W
5:30	DNA Subway and EOT Tools	Dave
6:00	Wrap-Up and Discussion; ID Cross-cutting Breakout Needs	Matt/All
6:30	Adjourn

Summary:

Introductions
The meeting opened with participant introductions. Steve W and Steve G talked about the iPlant Board of Directors meeting at a high level. Steve Goff said he would go into more detail about the meeting but wanted to wait until iPlant management receives the official BoD recommendations. However, the general take home message is that iPlant needs to start producing something tangible and useful now. iPlant is targeting the Year 3 meeting in May to demonstrate various prototypes. In order to do so, iPlant needs to know what are the priorities from each working group and what are the features/capabilities they would like delivered in the early versions of the Discovery Environments.

Report from NGS – Tom Brutnell
Tom Brutnell’s presentation can be found on the iPlant wiki . There was a suggestion to change the name from NextGen to Ultra High Throughput Sequencing as this is the terminology that is being adopted by the community. Brutnell stated that many people are excited and want to use the NextGen pipeline. There were questions as to what sequencing platforms that pipeline would support. Focus will be on Illumine, 454, and Solid but they all output essentially the same data and should work in the pipeline. Image files produced from the sequencers tend to get discarded as they take up so much space and there is software that translates the image files to text files.

The NGS 1.0 Pipeline design is complete and is intended to support a modular framework. The input will be a FASTQ formatted file and the output of the pipeline will be a SAM/BAM formatted file. Moving forward, it will be necessary to determine what is important to the G2P community and not just the NextGen community in defining future releases.

MWV 02-01-2010: I want to correct the statement of output here: One intermediate output of both workflows will be the universal output format, SAM, but a defined part of the two UHTS pipelines is that these alignments will be further processed to yield tabular, textual files containing genome variants and transcript quantification values. Users generally cannot make direct use of SAM alignments.

It ’s important to define what DNA sequence data will be integrated on. Suggestions were gene name or location on the genome but both of these are changing with time and that will create version challenges.

The NGS group would like to interact with the 1KP project but will need help to set up a formal interaction. Stanzione said that as part of the requirements for iPlant hosting the data, it will need to be made available to the public.

The NGS workflow has been presented to Sonya Lowry and the Core Software Development team. For the Core Software Development team, the high level NGS workflow is critical, but details of the NGS workflow are also needed to finalize software requirements. Core also needs a description of the scope of individual releases to prioritize development efforts.

AI: A discussion is needed to establish communication mechanisms between core and rest of group to ensure transparency.

Report from StatInf Group: Dan Kliebenstein
Dan Kliebenstein’s presentation can be found on the wiki. He reported that the StatInf group would like to increase performance of statistical specific methods by 100 fold. Initially, the group was considering including interval mapping in their methods but decided that in the future, markers won’t be limiting so this has been dropped from scope.

The first method that they have selected to work on is GLM. Three different implementations are being considered: traditional parallelization, GPU, and FPGA. Liya Wang is working on the traditional parallelization, with help from TACC. A group at UofA is working on the GPU implementation, and Convey, Inc is working on the FPGA implementation. Initially, GLM analysis will only be done with single markers but in the future, the StatInf group is hoping to expand to include two or more markers independently and/or an interaction model. The desire was expressed that if data is compressed in any of the analyses, people should be informed of the statistical bias.

Besides GLM methods, the StatInf group is looking at Bayesian approaches but there is concern due to the limiting line numbers. Buckler commented that priors would help, especially when you bring in pathways for analysis.

Brutnell asked about metadata and how will the group deal with experimental data, annotating it, and preserving it through the analysis. Jordan stated that iPlant and the Data Integration group would need to come up with minimal standards to be used by the community. Once there is agreement on some format of experimental data, this would then represent a third database (i.e., recording data that goes into experiments: in growth chamber, what is the temperature). The DI GROUP will handle this provenance. A controlled vocabulary and a definition of what the minimum standards are will need to be defined Gessler spoke of his work with the MIAME standard and cautioned that it took five years for agreement.

As stated earlier, the first deliverable for the group will be a parallel implementation of GLM. By April, Liya should have a first pass of porting the code; by August, the software should be done. This will be offered as a service through the web, but allocation and priority will given to those who have worked in the working groups

Based on discussions with Lauren McIntyre (editor of Genetics), Buckler suggested an XQTL prize. The idea would be to take the NAM datasets and set up a “contest” or open call to find the best algorithm or novel approaches to analyze the datasets. iPlant could co-sponsor the prize by providing CPUs, power, and technical support. The community wants GLM analysis as a standard but there could be better or different approaches to analyzing datasets.

Visual Analytics – Ruth Grene, Bernice Rogowitz, Greg Abram
Bernice Rogowitz, Ruth Grene, and Greg Abram’s presentation can be found on the iPlant wiki. All three stressed that with the workflows that have been created by the group, people need to make decisions quickly regarding what software to recommend. Myers stated that the big visualization issue will be usability. Ultimately, the visualization tools should have a medium-power user interface. Welch suggested something like the DNA
Subway visualization notion could be used as a good example of the ease of use of software. With the tools that are developed, visualization and analysis should happen in the same environment.

Data Integration
Doreen Ware – EC/US summary
Doreen Ware’s presentation of a summary of the EC/US database meeting that her, Steve G, and Dan S all attended can be found on the iPlant wiki. The objective of the meeting was to develop high-level recommendations for proposals as funding agencies cannot possibly fund every single model organism database. There was discussion about coming up with an evaluation and metric to warrant stewardships for archiving. There was no discussion on the difficulty of standardization but Ware offered that with the GO community, they agreed to disagree and were able to move on. Goff commented that most people still do not understand iPlant and what it is doing. There were a lot of discussions about the same things that iPlant is doing that were not related to iPlant.

Chris Jordan – DI group
Chris Jordan summarized a white paper that he had sent around earlier that day detailing the progress of the group. The DI group is different than the other working groups as it is based on the needs of the other groups. The DI group is beginning to look at the issues of provenance, metadata, and archiving. Jerry Lu is working on a survey for data sources to get more concrete description of data access, data formats, etc. iPlant does not want to host accessible data but instead will act as a third party gateway providing the capability to pull data into the iPlant environment and use it. Ware added that for this project to be successful, there is a need to be able to let people propagate data back to the knowledgebases. There has to be a programmatic way of being able to get data, bring it in, and also get the results back to the knowledgebases for stewardship. These present provenance issues and volunteers are needed to help with this.

Modeling
Jeff White -- see slides
Chris Myers -- see slides
Steve Welch
The group has made an offer to a postdoc who has accepted. Ann Stapleton will work in Manhattan in Feb on two different models.

EOT – Dave Micklos
Dave Micklos’ presentation can be found on the iPlant wiki. He introduced DNA Subway, suggesting that it could be considered as a mini-DE. The authoring interface is built with scripts but he does not know in what language. The DNA Subway is a good example of data moving back and forth. Narro asked for the group to think about what would it take for the tools that are being developed for the research community to move them to the EOT community.

Wrap-up and Breakout Discussion Identification:
The deliverable for this meeting should be a working plan with dates and deliverables for the overall project.

It was suggested that the two breakout sessions tomorrow should be defining NGS and DI needs and the UI/top level user appearance.