SI_20091110
iPG2P Statistical Inference
November 10, 2009
Attendees: Chris Myers, Dan Kliebenstein, Ed Buckler, Liya Wang, Peter Bradbury, Steve Goff, Steve Welch, Jean-Luc Jannink, Nirav Merchant
Action Items:
- Karla and Liya find the person at TACC who would help us identify the approaches available at TACC for GLM.
- Karla and Liya distribute the information about approaches available to group and we attempt to make decision via email to get things going. Possible Doodle poll.
- Everyone contemplate Mixed models and which of the two approaches would be most useful first and then second.
- Everyone (me especially) read up on Machine Learning.
- Dan will email everyone in a week to see how things are going.
Notes/Agenda:
- Personalities
- For iPlant, there is a general process to identify potential users and develop their profiles to help with the computational development (Know your audience).
- Are the following profiles sufficient at this time?
- Andres (the computational postdoc).
- Generates phenotypic data and wants to fully analyze all potential genotypic links for publication.
- Biologically savvy graduate student
- Generates phenotypic data and just wants to find a list of genotypes that might influence the phenotype
- Mathematical or Statistical scientist
- Wants to access the data and computational resources to test their algorithm for genotype to phenotype linkage
- Would want the ability to insert new algorithms into the package for use.
- Would want the ability to get performance measures on the computational time per test, etc.
- Highschool or undergraduate lab student
- Might have a mapping population and do a phenotyping analysis in a lab course.
- Would then want to be able to do the genotype to phenotype tests and get some generally informative answer.
- Likely involve links to other modules
- Lab Instructor
- Ability to understand what their students are doing with the module.
- Andres (the computational postdoc).
- Data – Was sent along to other groups.
- Data sets that could be used to test the below implementations
- Maize NAM structure
- 5000+ lines
- 1.8 million SNP genotypes (to be 30 million in future)
- 100+ phenotypes
- 10+ environments for some phenotypes
- Arabidopsis eQTL within Recombinant Inbred line population
- 200 or 400 lines
- 500+ genotypes
- 60,000 phenotypes (mean and per line variance for each transcript)
- 2 tissues
- Arabidopsis Metabolomics on genome wide association mapping lines.
- 96+ lines
- 250000 genotypes
- 1000+ phenotypes (mean and per line variance for each metabolite)
- Simulated datasets
- 50000+ lines
- Saturated genotypes (whatever that means)
- Coalescent genotypes vis selfing or outcrossing models
- All additive variance components versus some epistatic variance
- Moderate core (10-20 genes -simple secondary metabolism) versus larger core
- Moderate core = 10-20 genes explaining most of variance
- Boring secondary metabolism?
- Larger core = 100-200 genes explain most of variance
- use E.coli metabolism to biomass model
- Check on yeast mechanistic models
- Moderate core = 10-20 genes explaining most of variance
- Biomorphic versus polymorphic loci
- All factorials of above for comparison and potentially unique group paper
- Maize NAM structure
- Data sets that could be used to test the below implementations
- Algorithms
- GLM (ANOVA)
- What is framework to implement?
- Need information about what TACC has available for GLM
- ??
- ??
- What is framework to implement?
- Mixed Model
- Identified two potential differences
- Where only one random matrix is essential (structure) possibly use pre-processing to address other issues
- EMMA with two SNPs or more on moderate datasets
- EMMA with one SNP on monster datasets
- Where there are multiple random factors of interest to the researcher and wants both addressed simultaneously (i.e. G x Random E)
- AS REML
- Check on animal and human literature
- Where only one random matrix is essential (structure) possibly use pre-processing to address other issues
- Identified two potential differences
- Machine Learning
- Citations via Jannink
- Application to plant data:
- Bedo, J., P. Wenzl, A. Kowalczyk, and A. Kilian. 2008. Precision- mapping and statistical validation of quantitative trait loci by machine learning. BMC Genetics 9:35
- Application to human data:
- Wei, Z., K. Wang, H.-Q. Qu, H. Zhang, J. Bradfield, C. Kim, E. Frackleton, C. Hou, J.T. Glessner, R. Chiavacci, C. Stanley, D. Monos, S.F.A. Grant, C. Polychronakos, and H. Hakonarson. 2009. From Disease Association to Risk Assessment: An Optimistic View from Genome-Wide Association Studies on Type 1 Diabetes. PLoS Genet 5:e1000678 .
- General discussion of sort of linear model type approaches like the ones we have mostly been discussing so far, versus machine learning type approaches that JLJ found enlightening:
- Breiman, L. 2001. Statistical Modeling: The Two Cultures. Statistical Science 16:199-231 .
- Application to plant data:
- Citations via Jannink
- Bayesian
- Previously decided to postpone Bayes for future
- GLM (ANOVA)
- Next Meeting – Tuesday December 1stat the same time