Optimization
Name: Optimization |
Start date: 9/1/2011 |
Target Release: |
Status: In Planning |
Lead: Kurt Michels (kamichels@iplantcollaborative.org) |
Background
1) Many packages in comparative methods for phylogenetics use numerical optimization to estimate parameter values and likelihood scores.
2) The general sense in our community is that numerical optimization in R sometimes does not work well.
2.i) One anecdote in support of this: the widely-used R package ape, in its function ace, would estimate ancestral states. The values it got in some cases were strongly dependent on starting value. Naim and Jeremy found fixes for this, and it has now been corrected in the most recent version of ape.
2.ii) Another anecdote in support of this: Barb and I found that the popular package geiger uses bounds for many of its parameter values. For some of its continuous models, using its single example dataset, we found that the results it got back were at the bounds, indicating they affected the search and may be preventing the discovery of the global optimum. We then hacked geiger's fitContinuous function (thus "fitContinuous.hacked") to use the R function nlm() rather than optim() for optimization. nlm() worked without bounds and the results seemed to work (i.e., the same as optim() in areas where optim() seemed to work well and reasonable results in the areas where optim() seemed to fail).
Worryingly, these anecdotes weren't from functions that were never used but rather from things that could be used as basically the main results in papers. So, what we want overall is to be able to say to the R comparative methods community: "to do optimization well, you should do X and not Y or Z". For example, it seems from our hacking of geiger that nlm() works better than optim(), so perhaps we could say to people, "use nlm(), not optim()". Similarly, for parameters that are naturally constrained to be positive, perhaps we could say, "in cases of parameters that have to be positive, optimize ln(param) and then transform it back to the natural scale of that parameter, rather than applying a constrained search". The recommendations for our domain might be different for general optimization in R, since we often have to deal with really small likelihoods. But all we have now are two anecdotes, which isn't really enough to be safe making recommendations.
Project Description
A) you could take the cases we have identified as test cases but then be able to generalize to make recommendations of this sort. I would imagine this would be based on looking at several different problems in this domain, showing how different numerical optimization approaches fail/succeed, and then combine that with other information about numerical optimization or stats generally to be able to make reasonable recommendations. Remember, the audience is biologist-programmers, who probably choose optim() because it comes up first in the help files and seems easy to use, so any thought about what will be better is an improvement.
B) barring that, at least be able to say that in the limited case of geiger, for example, nlm() should be used instead optim() with bounds. All this would entail is showing that over a range of known values, nlm() always performs at least as well as optim() and sometimes much better.
Add a description of the project, what strategy is being used, etc...
Deliverables
- A test rig for optimization algorithm in phylogenetic contexts.
- A collection of test results that identify problematic algorithms/implementations and the conditions at which they fail
- Recommendations to what algorithms are most suited for the different tasks
Ideally, this project will result in a publication.
Milestones
Critical steps. Refer to project description and deliverables.
Milestone |
ETA |
Reached |
---|---|---|
|
|
|
|
|
|
|
|
|
Team members (if applicable)
Name |
Role |
Contact |
---|---|---|
|
|
|
Origin
Links to the discussions that led to the project
Jira link
Required to track progress and individual tasks
Dependencies/Related projects
If this project will be integrated with other projects, the connection should be listed here