SI_20091110

iPG2P Statistical Inference

November 10, 2009

Attendees: Chris Myers, Dan Kliebenstein, Ed Buckler, Liya Wang, Peter Bradbury, Steve Goff, Steve Welch, Jean-Luc Jannink, Nirav Merchant

Action Items:

  • Karla and Liya find the person at TACC who would help us identify the approaches available at TACC for GLM.
  • Karla and Liya distribute the information about approaches available to group and we attempt to make decision via email to get things going. Possible Doodle poll.
  • Everyone contemplate Mixed models and which of the two approaches would be most useful first and then second.
  • Everyone (me especially) read up on Machine Learning.
  • Dan will email everyone in a week to see how things are going.

Notes/Agenda:

  1. Personalities
    1. For iPlant, there is a general process to identify potential users and develop their profiles to help with the computational development (Know your audience).
    2. Are the following profiles sufficient at this time?
      1. Andres (the computational postdoc).
        1. Generates phenotypic data and wants to fully analyze all potential genotypic links for publication.
      2. Biologically savvy graduate student
        1. Generates phenotypic data and just wants to find a list of genotypes that might influence the phenotype
      3. Mathematical or Statistical scientist
        1. Wants to access the data and computational resources to test their algorithm for genotype to phenotype linkage
        2. Would want the ability to insert new algorithms into the package for use.
        3. Would want the ability to get performance measures on the computational time per test, etc.
      4. Highschool or undergraduate lab student
        1. Might have a mapping population and do a phenotyping analysis in a lab course.
        2. Would then want to be able to do the genotype to phenotype tests and get some generally informative answer.
        3. Likely involve links to other modules
      5. Lab Instructor
        1. Ability to understand what their students are doing with the module.
  2. Data – Was sent along to other groups.
    1. Data sets that could be used to test the below implementations
      1. Maize NAM structure
        1. 5000+ lines
        2. 1.8 million SNP genotypes (to be 30 million in future)
        3. 100+ phenotypes
        4. 10+ environments for some phenotypes
      2. Arabidopsis eQTL within Recombinant Inbred line population
        1. 200 or 400 lines
        2. 500+ genotypes
        3. 60,000 phenotypes (mean and per line variance for each transcript)
        4. 2 tissues
      3. Arabidopsis Metabolomics on genome wide association mapping lines.
        1. 96+ lines
        2. 250000 genotypes
        3. 1000+ phenotypes (mean and per line variance for each metabolite)
      4. Simulated datasets
        1. 50000+ lines
        2. Saturated genotypes (whatever that means)
        3. Coalescent genotypes vis selfing or outcrossing models
        4. All additive variance components versus some epistatic variance
        5. Moderate core (10-20 genes -simple secondary metabolism) versus larger core
          1. Moderate core = 10-20 genes explaining most of variance
            1. Boring secondary metabolism?
          2. Larger core = 100-200 genes explain most of variance
            1. use E.coli metabolism to biomass model
            2. Check on yeast mechanistic models
        6. Biomorphic versus polymorphic loci
        7. All factorials of above for comparison and potentially unique group paper
  3. Algorithms
    1. GLM (ANOVA)
      1. What is framework to implement?
        1. Need information about what TACC has available for GLM
        2. ??
        3. ??
    2. Mixed Model
      1. Identified two potential differences
        1. Where only one random matrix is essential (structure) possibly use pre-processing to address other issues
          1. EMMA with two SNPs or more on moderate datasets
          2. EMMA with one SNP on monster datasets
        2. Where there are multiple random factors of interest to the researcher and wants both addressed simultaneously (i.e. G x Random E)
          1. AS REML
          2. Check on animal and human literature
    3. Machine Learning
      1. Citations via Jannink
        1. Application to plant data:
          1. Bedo, J., P. Wenzl, A. Kowalczyk, and A. Kilian. 2008. Precision- mapping and statistical validation of quantitative trait loci by machine learning. BMC Genetics 9:35
        2. Application to human data:
          1. Wei, Z., K. Wang, H.-Q. Qu, H. Zhang, J. Bradfield, C. Kim, E. Frackleton, C. Hou, J.T. Glessner, R. Chiavacci, C. Stanley, D. Monos, S.F.A. Grant, C. Polychronakos, and H. Hakonarson. 2009. From Disease Association to Risk Assessment: An Optimistic View from Genome-Wide Association Studies on Type 1 Diabetes. PLoS Genet 5:e1000678 .
        3. General discussion of sort of linear model type approaches like the ones we have mostly been discussing so far, versus machine learning type approaches that JLJ found enlightening:
          1. Breiman, L. 2001. Statistical Modeling: The Two Cultures. Statistical Science 16:199-231 .
    4. Bayesian
      1. Previously decided to postpone Bayes for future
  4. Next Meeting – Tuesday December 1stat the same time