A Guide for Further Developing Winnow: Statistician Edition
Introduction
This tutorial is designed to guide more statistically savvy users in adding performance metrics to the Winnow validation program. To give a brief overview, Winnow is the validation section of the Validate Workflow. Given the “known-truth” of a dataset and the output from a genome wide association studies (GWAS) analysis tool, Winnow can produce a set of fit statistic values to help users evaluate the appropriateness, or lack thereof, of said analysis tool. The Winnow program itself is actually divided into several files.
Winnow program architecture:
- winnow.py: Contains and initiates the main function of the application. It also makes all necessary calls to other files.
- checkhidden.py: Provides functions for checking for hidden files within a directory to insure those are not included in any analysis. (This file does not need to be modified.)
- data.py: Sets up the "Data" class which transforms a delimited file into useable data
- fileimport.py: Provides functions to import data. The current options are white space, comma, and tab delimitation
- gwas.py: Contains functions for executing analysis of a GWAS application. For a prediction application, additional functions will need to be made.
- commandline.py: Contains functions for retrieving command line arguments.
- performetrics.py: contains all individualized functions for performance metrics used in validation of an application. This is the file to which one would add the actual formula for any additional performance metrics.
When the GWAS output is passed to the functions in the performetrics.py file, several objects are used in calculating the fit statistics.
Object Name | Object Definition |
betaColumn | The generated list or vector of SNP weights. This will only be generated if a valid effect size/beta column is specified for the Winnow analysis via the --beta command line argument. |
betaTrueFalse | The effect sizes obtained from the known-truth file (i.e. files with an *.ote extension). Generally, this list has the same length as betaColumn, except it contains zeros for SNPs with no effect, and then the explicit quantitative effect for all other SNPs. Like the betaColumn object, this will only be generated if a valid effect size/beta column is specified for the Winnow analysis via the --beta command line argument. |
snpTrueFalse | This "truth" list is equivalent to betaTrueFalse but expressed categorically with booleans rather than quantitatively (i.e. all values will be either True or False). |
scoreColumn | This list is the list of generated "scores" reflecting the score assigned to that SNP from the GWAS application. For example, this is the column that would be generally indicated for P-values. |
threshold | This is a scalar quantity designed for some metrics that require a threshold such as True Positive Rate, for example. It represents the significance level for an analysis; therefore, if any SNP effect size is less than the threshold, it is deemed statistically significant. Can be used in conjunction with the p-value adjustment option (see the Winnow README file for more details). |
covar | The list or vector of covariate weights. This object will only be created if a covariate column in the GWAS output is specified via the --covar command line argument. |
Winnow Change Log:
Validate Workflow version 0.9 – Released September 2015 – Added covariate implementation for certain types of GWAS data, a P-value adjustment option from statsmodels Python package, and a feature for generating parameter files with each Winnow run through. Increased and automated unit test coverage via Travis CI.
Validate Workflow version 0.7 – Released June 2015 – Major restructuring of Winnow source code for optimized performance. Revamped documentation and Github repository structure. Added more GWAS tools for use with Winnow, and began unit test coverage for key Winnow functions.
Validate Workflow version 0.5 – Released February 2015 – First introduction of Validate as a workflow rather than a stand-alone program. Renaming of eponymous Validate program to Winnow. Integrated other GWAS analysis tools and additional performance metrics.
Validate for Python version 0.8.0 – Released September 2014 – First release of Python re-write. Validate in Beta. All functionality works with defects in the H measure.
Validate for R version 0.8.0 – Released May, 2014 – Re-write of R version of Validate. Same version as included on the Validate-toolkit but installed on the DE.
Validate-Toolkit-0.3.0 – New version of the Validate written in R released on Atmosphere Spring, 2014 – Included major expansion of features. Including ability to handle different file types (for truth files), some automatic file transformation (for truth files), ten new performance measures, and corrected a bug that forced Validate to fail when only a single result/output file was being validated. (Under Github repository name "ktaR")
Validate for R version 0.3.0 – Released on DE Fall, 2013 – First release of the R version of validate. Version was an Alpha but most of all functionality was available. (Under Github repository name "ktaR")
Binary classification in Winnow:
At their core, all GWAS analysis tools are essentially binary classification models. In other words, these tools work to separate every SNP into two categories: significant or not significant. While the GWAS tools may extract other information from the data to analyze (e.g. phenotype, genetic variance), arguably the most important part is the statistical significance of each SNP in comparison to a dataset’s known-truth. Because of this setup, it is very easy to visualize a GWAS tool’s performance via a confusion matrix (also called a contingency table). A confusion matrix is based on four main statistics:
- True positives: P(identified as significant | SNP is significant)
- False positives: P(identified as not significant | SNP is significant). In a typical frequentist hypothesis test, this would also be labeled a type I error.
- True negatives: P(identified as not significant | SNP is not significant)
- False negatives: P(identified as not significant | SNP is significant). In a typical frequentist hypothesis test, this would also be labeled a type II error.
Confusion Matrix Illustration:
Classified Significant | Classified Not Significant | Total: | |
---|---|---|---|
Known-truth positive | True Positive | False Negative | P' |
Known-truth negative | False Positive | True Negative | N' |
Total: | P | N |
Note that Winnow output includes all of these confusion matrix values. Beyond these four main values, one can derive a number of other fit statistics, most of which are included in the Winnow program. For example:
- Sensitivity: The number of correctly identified true positives divided by the total number of known-truth positives (i.e. true positives + false negatives). Range: 0 ≤ sensitivity ≤ 1
- Specificity: The number of correctly identified true negatives divided by the total number of known-truth negatives (i.e. true negatives + false positives). Range: 0 ≤ specificity ≤ 1.
- Precision: The number of correctly identified positives divided by the total number of detected positives (both true and false). If neither true nor false positives were detected (meaning nothing was detected as significant), then Winnow returns “undefined” as the precision value. Range: 0 < precision ≤ 1.
- Matthews’ correlation coefficient: Metric for measuring the overall quality of a binary classifier. Range: -1 ≤ MCC ≤ 1. A coefficient of 1 indicates perfect classification, a coefficient of -1 represents perfectly wrong classification, and 0 indicates the classifier is equivalent to guessing randomly. See Wikipedia for more information: https://en.wikipedia.org/wiki/Matthews_correlation_coefficient.
Adding a performance metric to Winnow: Example
So now that we have an idea of what type of fit statistics Winnow works with, how can we add one of our own? Due to a number of factors, namely the structure of the Winnow software, the use of popular scientific Python modules, and the ease of use for the Python language as a whole, the act of adding in your own performance metrics is simple. For example, suppose one wanted to include a prevalence metric in Winnow to see the proportion of actually significant SNPs in the total sample. The main file you will be editing is the performetrics.py file since that is where the function for prevalence will be stored. In said performetrics file, write out the function header for the prevalence formula:
Note that this function requires three of the objects listed above: the snpTrueFalse column, the significance threshold, and the scoreColumn holding the p-values. Next, we will need to set up the four main confusion matrix statistics (since prevalence involves the total). Fortunately, since these values are already included in Winnow, we may simply call these functions rather than rewriting our own.
Then, we must instruct Winnow to return the value for prevalence whenever the function is called; therefore, we finish the function with this return statement:
Also, for the sake of clarity, it helps to add some comments underneath the function header to explain to others what the function does and what it returns. In the case of Winnow, all function descriptions are blocked off by triple quotes, not only so we can write multi-line comments but also so that Python can detect these descriptions as docstrings (for the doctest module; see here: https://docs.python.org/2/library/doctest.html).
Finally, to make sure Winnow recognizes the function, we must add it to the gwas functions in the gwas.py file. To do this, write the function name in the first list with the other names, and write the function call in the second list. The first list represents the names that will be used in writing out the header for Winnow output, and the second list calls the functions to return the actual values. Keep in mind that the order of the lists needs to be consistent (e.g. if you write “prevalence” as the last entry in the names list, make sure it is the last function call in the second list). Since this function works strictly with SNPs and SNP scores, it will be included regardless of whether effect sizes/betas or covariates are specified. Because of this universality, you will need to repeat the function name writing for all four possible gwas functions.
Putting it all together:
Now that we have our new fit statistic, let’s try it out with some data! For this example, we will bring in some preexisting PLINK output. To do this, we will run the Winnow program through the command line like so:
Note that the terminal prompt was returned; therefore, the program ran without error. Now, if we examine the Winnow output in a more formatted output (Excel in this case), we can see that our prevalence metric is included among the other fit statistics.
And that is all we need to integrate other fit statistics into Winnow. If you have other features you would like to be included in later versions of the Winnow software, feel free to contact us at labstapleton@gmail.com