GLM_20110516
Participants
Ali Akoglu |
Dave Lowenthal |
Martha Narro |
Tapasya Patki |
Matt Vaughn |
|
Agenda
Discuss the update posted by Ali on 5/6/2011.
Notes
- Population effects are now handled.
- GPU version of PLM
- Far more efficient than the GLM version was.
- They got further than expected, so ran into some issues working on TACC.
- Have an mpi version and mpi version with gpu on tacc.
- Performance of gpu implemetation is not as good as they would like.
- Matt's philosophy is: get it working, then get it working right, then get it working fast
- They are in the get it fast stage
- Multi node multi gpu has issues that would not come up in the other implementation.
- For example, where will data reside in this implementation.
- TP: The accuracy is fine.
- The issue is how to layout the gpu and distrubute the data.
- MV: The 1 to 1 comparison is a decent speed up.
- The performance problem is in the communication.
- A: Which is more informative for you: time spent on alignment per SNP or complete execution time?
- MV: Thinks execution time per SNP is what needs to be optimized.
- D: Makes Sense.
- 448 vs 362 sec is a ____ (CPU mpi is one node one core)
- Numbers are much better off Longhorn.
- It's probably that we are new on Longhorn. Expect within a month it will be better.
- Getting only a 20% speed up in a gpu would not be satisfactory to us.
- MV: Longhorn is large and has older, single precision GPUs.
- Lonestar has a decent number of new Fermi nodes(?). Might that help?
- D: We could try that if our latest runs do not look better.
- Would need account.
- MV: I can arrange that. No problem.
- A: cpu gpu hybrid approach not giving great improvement.
- Reduce memory footprint and run whole thing on gpu. To reduce memory IO overhead.
- There is a limit in terms of number of total threads.
- MV: Won’t change size of dataset, just number of threads?
- D: correct
- M: That’s a bit unintuitive. Thanks for clarifying. Wondered about it in the update.
- D: That's why we wanted to talk.
- A: Remember that the cpu version also improved (glm vs clm). If we hadn’t improved the cpu version (baseline for comparison), the numbers would look better.
- MV: Is that the c++ from John Peterson?
- TP: Yes, modified by Peter.
- MV: nclude the run with that original code. That would make a nice comparison.
- Ali describing next steps to work on:
- 1) Further optimization on current implementation
- 2 ) Ideal comparison would be build from SVD eliminated, and the partitioned linear model (PLM).
- MV: This problem launched a long linear algebra discussion here.
- 3) Stepwise regression
- MV: they are building Tassel into the DE.
- Glm, clm, modular...
- It would be nice to drop a hook to this code into their DE tool.
- Once it is running on TACC's hardware, there is a lot more to tassel than this algorithm.
- MV: This project has been an interesting model for data scalability that can not just be wrapped or handled by additional cores/nodes (expensive). It illustrates that the original algorithm's code needs to be optimized.
- Dave and Ali agreed to post monthly updates to the wiki.
Decisions
- Next work to be done:
- 1) Further optimization on current implementation
- 2) Ideal comparison would be build from SVD eliminated, and the partitioned linear model (PLM).
- 3) Stepwise regression
- Dave and Ali will post monthly updates to the wiki.