GLM_20110516

Participants

Ali Akoglu	Dave Lowenthal	Martha Narro
Tapasya Patki	Matt Vaughn

Agenda

Discuss the update posted by Ali on 5/6/2011.

Notes

Population effects are now handled.
GPU version of PLM
Far more efficient than the GLM version was.
They got further than expected, so ran into some issues working on TACC.
- Have an mpi version and mpi version with gpu on tacc.
- Performance of gpu implemetation is not as good as they would like.
Matt's philosophy is: get it working, then get it working right, then get it working fast
- They are in the get it fast stage
Multi node multi gpu has issues that would not come up in the other implementation.
- For example, where will data reside in this implementation.
- TP: The accuracy is fine.
- The issue is how to layout the gpu and distrubute the data.
MV: The 1 to 1 comparison is a decent speed up.
The performance problem is in the communication.
A: Which is more informative for you: time spent on alignment per SNP or complete execution time?
- MV: Thinks execution time per SNP is what needs to be optimized.
- D: Makes Sense.
448 vs 362 sec is a ____ (CPU mpi is one node one core)
Numbers are much better off Longhorn.
It's probably that we are new on Longhorn. Expect within a month it will be better.
Getting only a 20% speed up in a gpu would not be satisfactory to us.
MV: Longhorn is large and has older, single precision GPUs.
- Lonestar has a decent number of new Fermi nodes(?). Might that help?
- D: We could try that if our latest runs do not look better.
- Would need account.
- MV: I can arrange that. No problem.
A: cpu gpu hybrid approach not giving great improvement.
Reduce memory footprint and run whole thing on gpu. To reduce memory IO overhead.
There is a limit in terms of number of total threads.
MV: Won’t change size of dataset, just number of threads?
- D: correct
- M: That’s a bit unintuitive. Thanks for clarifying. Wondered about it in the update.
- D: That's why we wanted to talk.
A: Remember that the cpu version also improved (glm vs clm). If we hadn’t improved the cpu version (baseline for comparison), the numbers would look better.
MV: Is that the c++ from John Peterson?
TP: Yes, modified by Peter.
MV: nclude the run with that original code. That would make a nice comparison.
Ali describing next steps to work on:
- 1) Further optimization on current implementation
- 2 ) Ideal comparison would be build from SVD eliminated, and the partitioned linear model (PLM).
  - MV: This problem launched a long linear algebra discussion here.
- 3) Stepwise regression
MV: they are building Tassel into the DE.
- Glm, clm, modular...
- It would be nice to drop a hook to this code into their DE tool.
- Once it is running on TACC's hardware, there is a lot more to tassel than this algorithm.
MV: This project has been an interesting model for data scalability that can not just be wrapped or handled by additional cores/nodes (expensive). It illustrates that the original algorithm's code needs to be optimized.
Dave and Ali agreed to post monthly updates to the wiki.

Decisions

Next work to be done:
- 1) Further optimization on current implementation
- 2) Ideal comparison would be build from SVD eliminated, and the partitioned linear model (PLM).
- 3) Stepwise regression
Dave and Ali will post monthly updates to the wiki.