Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

...

During the Steering Committee meeting in January, the group started to talk about about getting people to work with real data and suggested the NAM dataset would be a good set to start with. White's White’s presentation relating to his work with this dataset can be found here .

...

White stated that the above process is pretty typical of what he goes through when ramping up to work on modeling a new species. Data has to be imported, checked and formatted. As an example of checking, with weather, he uses simple checks to make sure the minimum is lower than max. The DSSAT package tool has the ability to check weather data. White said it is easier for him to write a SAS program and start merging spreadsheets together. Once checked and formatted, he has a model-ready data set. However, DSSAT has the limitation of only allowing 999 treatments and with this dataset, there are 55k treatments. White wrote a Python script to simulate a user working at the terminal but would want to run all 11 possible experiments as a single model which would be important for optimizing the program and would also help with processing.

To White's White’s knowledge, this type of workflow has only been done 3-4 times in the crop modeling community and it has never been done with QTL data and DSSAT. DSSAT is thinking about going open-source and Vaughn suggested that iPlant could support forward-looking parts of DSSAT.

...

Chris Myers presented a shortened version of his presentation from the Steering Committee meeting and this can be found here . He has code called "SloppyCell"“SloppyCell”. He suggested that it would be nice to have something like MapMan for dynamic visualization. With PathVizio and GPML, the topology is embedded in the model and they are not really modeling languages but more layout languages. Myers asked if there are ways to automate layouts of pathways; Is there a generic approach to developing visualization tools on top of modeling? He also suggested that perhaps something that was missed in the workflow that Welch created was model construction.

...

Welch presented on his work/research that he has been doing on modeling languages and tools that exist for the modeling community. He suggested that perhaps SBML isn't isn’t expressive enough to represent ecophysiological models. Myers said that you could take the SBML model and not have the environment be part of it and still use the model. Other tools Welch added to the list include Systems Biology Workbench, JSim, and OpenMI.

Discussion
After the presentations, further discussion ensued. Stanzione asked what is the right interface for the most important set of users and who is that most important set of users. Myers responded that he doesn't doesn’t use interfaces as they can be decoupled using SBML. With students who are just beginning to understand modeling, a GUI is nice. He suggested that perhaps there should be focus on standardizing tools. Welch added that to begin with, it is worth spending time to build a GUI. iPlant wants to make sure to hit practical scientists at the cutting edge first or those at the core of the GC who are attacking the G2P problem. However, there is a wide swath of people that will not have the same skill level. As Myers pointed out, in the room, there are different worlds of modeling represented and in a first pass, we can't can’t accommodate all. With the StatInf group, they agreed on what the obvious problem was and just did that. In this working group, there is not consensus as to what the problem is. Stanzione asked if there was a common agenda; if not, can we provide a CI and tools that will connect tools and make them run faster?

Welch suggested that the group focus on 1) parameter estimation, 2) sensitivity analysis (change input parameter and see how it changes the output; what happens as you change multiple parameters) and 3) interface with QTL mapping activity. White wondered if perhaps the group is overweighting the QTL analysis and perhaps this is not something that people would be doing all the time. Welch added that QTLs are not mechanistic and there is an open scientific question as to whether QTL models can provide level of predictability of what is going on in non-constant environments. He suggested that a better approach might be to attack from both ends of the spectrum and the problem may not be as intractable as the group thinks. If the right set of tools is built, a solution to the G2P problem is possible in the time frame of iPlant. However, the key thing is to look at the synthesis of different kinds of things people are doing now. Myers added that the group hasn't hasn’t really addressed model construction and hasn't hasn’t really defined what modeling is or what people think it is.

Parameter Estimation (inc. estimation post-processing such as ANOVA runs)
Welch opened discussion up by asking if there is agreement that parameter estimation is part of the modeling process and that it is compute intensive. There were no arguments to that statement. Myers asked if most of the crop models are ODE or stochastic because if they are stochastic, then parameter estimation becomes horrific. Myers approach is all ODE but when everything is written out, there is no specification needed in SBML. His group hasn't hasn’t figured out how to use tools in stochastic situations but there are people that have.

Welch stated that there is agreement that focus of parameter estimation should be on ODE type models. He suggested that an exemplar algorithm be included but asked with what characteristics. As an example, would the group want to do grid search stuff that is embedded in GenCalc (though he does not advocate this) or maybe take a population based approach. White stated that he has no preference as iPlant is not into algorithm development but in CI development. With GenCalc, there are things that are very useful that you don't don’t often think of: hand edit parameter file, backup of parameter file, automatically update parameter file, etc.

Inputs should be the same and outputs should be the same. For his grid search algorithms, he specifies the max number of iterations, how much to reduce the size of the grid search in following iterations, selects initial parameters, and selects upper and lower bounds. For this, he suggested that the group should flesh out a standard set of inputs. Myers stated that when looking for optima, there's there’s the objective function and problems of initial guess. For algorithms, he uses the Nelder-Mead Simplex method and has found that gradients are typically most useful. Optimization is fickle, as you don't don’t know if you are not getting the solution because you are stuck in parameter space or because the model does not describe the data. Welch added that he and his collaborators have found to work well is a top-level algorithm to explore parameter space and something nested to explore local opportunities. He has been successful in using Nelder-Mead algorithm for local search and also in using it alone. His suggestion would be to do a combination like described above.

...

Myers stated that abstractions for optimizations are good. With having an abstraction, if your data is put in the iPlant framework, then the person can use the CI. Welch commented that more often than not, calculating the objective function is the expensive part. He has found that he runs one machine for optimization and uses his others for evaluation of the objective function. The length of his biggest run to date has been 3 days on 200 processors, done 4 times. Myers added that generating MC parameters is a long process; you have to first monitor what is the correlation time and this can go on for several days. Stanzione commented that the group wants to enable runs on processors but then there is the issue of managing the multiple runs. Myers said that with population-based optimization, he would like to have the management all hidden, as you don't don’t need to be explicitly looking at it. Welch said that the data management should be generic. Myers added that it would be nice if the launcher were embedded in the code. Stanzione said that it would be possible to either take parameters to launch from the command line or can make it internal to the code.

...

Stanzione stated that the solution is three-tiered. With the API, a user can use their own code and it will only crash their stuff. If a component is to be registered, it will go through automatic checks to make sure that it doesn't doesn’t crash the framework. Tier 2 will be community rated. For those that are highly rated, they will move to tier 3 or the gold standard complete with validation testing.

...

Sensitivity / Response Surface Analysis
Welch opened the discussion about sensitivity analysis stating that it might be done either before or after parameter estimation. With White's White’s recent work, he did estimation runs and then did a series of ANOVAs. White added that his analysis was done several times but this is not really seen in the modeling community. Someone cynical would ask if model is better than generic variation. He asked questions relating to where should he look to improve overall simulations; with populations, where is further research needed. White's White’s real point is that once finished with basic fitting, there are basic analyses that need to be done. In any of the workflows, there is a need for post-calibration analysis.

Welch summarized saying that are various categories such as observed vs predicted, decomposing sources of variation in the results, and cross-validation processes that will need a fairly large data set. White added that he did bootstrap cross-validation to check the model. In general, people will either have independent data or the opportunity for cross-validation or they don't don’t do either as they don't don’t have enough data. White suggested that perhaps one should first do sensitivity analysis to check the model and then do parameter estimation (calibrate and validate until confident) and then test fit will all the observations. However, the robustness of parameter estimates is different than the robustness of predictions.

...

When models are in a sufficient format, derivatives can be taken symbolically and if the models aren't aren’t in such a format, a user should be able to make scatter plots of the response surface. Welch asked if MCMC is a possibility for estimating the distribution of parameters. Myers said that MC is good at picking numbers in an area. He also uses a Hessian method to get local optima. Welch added that bootstrapping is different from MCMC and is more reflective of the data itself. Stanzione stated that PECOS has spent a lot of money on DAKOTAH, a suite of analysis tools, and maybe it might be worth looking into incorporating one of the tools into the environment.

...

Myers presented some of the visualizations that he has used in the past. If the clustering is curved, you'll you’ll see variance that looks much larger. He suggested that you could do projections but maybe related to linear combinations. One could explore parameter spaces but overlay local structure on them. Welch suggested multidimensional scaling.

...

Stanzione suggested that Welch write an English description of what the tools do that are represented on Welch's Welch’s slide for further investigation by iPlant.