Navigate space

Validate for the Developer

This documentation is intended to provide a comprehensive manual of how to add to Validate and/or alter Validate in any way to provide a more useful way to test genotype to phenotype association methods with known-truth data sets. Importantly, this manual is supposed to work in conjunction with the doc strings and other documentation within the code itself. This might be a way to jump in to the action of altering the code much quicker, and that is the intention here.

Validate, as you may be aware, is a tool written in Python that is intended to completely replace other more "ad-hoc" known-truth testing methods.

Known-truth testing is quite simply, a method of creating realistic data sets where we "know the truth" and then seeing how well our methods identify this "truth." Validate covers the second part of this sentence.

The code for Validate is written in Python. This was chosen because it is an easy-to-understand language. And since methods for Validation may change over the years, this manual was written to effectively tell tool developers and others how to alter the code so that it can appropriately do what they need it to do.

Although, in many ways, this documentation may reflect what you would expect from an API developer doc, Validate is more of an open-source script than an API. However, we have worked hard to heavily modulate the scripts to be easily modifiable and understandable by others. Further, by including well-known scientific modules, we hope to encourage the vast usage of these modules in writing additional functions and classes for Validate.

Module, Object, and Function Reference

Validate is divided up in to several modulated files listed below. For function and class references, please see the doc strings above each function in the Python code.

File Name	Purpose of File
validate.py	This file contains and initiates the main function of the application. It also makes all necessary calls to other files.
checkhidden.py	This file does not need to be modified. It simply provides functions for checking for hidden files within a directory to insure those are not included in any analysis.
data.py	This file contains a class named "Data" which transforms a delimited file in to useable data.
fileimport.py	This file provides functions to import data. Currently, white space and comma delimitation are all that's included.
gwas.py	This file contains functions for executing analysis of a GWAS application. For a prediction application, additional functions will need to be made.
commandline.py	This file contains functions for retrieving command line arguments.
performetrics.py	This file contains all individualized functions for performance metrics used in validation of an application. This is the file we expect will be most heavily modified in the future.

There are five main objects (lists in particular) that you will need to understand in order to write additional functions (especially metric functions). These are:

Object Name	What the Object Actually Is
betaColumn	The generated list or vector of SNP weights
betaTrueFalse	The "truth" about the betas expressed numerically. This is generated from truth files that are specified during execution (files with and *.ote extension. Generally, this list has the same length as betaColumn, except it contains zeros for SNPs with no effect, and then the explicit quantitative effect for all other SNPs.
snpTrueFalse	This "truth" list is equivalent to betaTrueFalse but expressed categorically with booleans rather than quantitatively.
scoreColumn	This list is the list of generated "scores" reflecting the score assigned to that SNP from the GWAS application. For example, this is the column that would be generally indicated for P-values.
threshold	This is a scalar quantity designed for some metrics that require a threshold such as True Positive Rate, for example.

Change Log

Validate for Python release 0.8.0 – Scheduled for Mid-June, 2014 – First release of Python re-write. Validate in Beta. All functionality works with defects in the H measure.

Validate for R release 0.8.0 – Released May, 2014 – Re-write of R version of Validate. Same version as included on the Validate-toolkit but installed on the DE.

Validate-Toolkit-0.3.0 – New version of the Validate written in R released on Atmosphere Spring, 2014 – Included major expansion of features. Including ability to handle different file types (for truth files), some automatic file transformation (for truth files), ten new performance measures, and corrected a bug that forced Validate to fail when only a single result/output file was being validated. (Under Github repository name "ktaR")

Validate for R release 0.3.0 – Released on DE Fall, 2013 – First release of the R version of validate. Version was an Alpha but most of all functionality was available. (Under Github repository name "ktaR")

Developer Guide

Adding an additional performance metric

Adding an additional performance metric in to the Python version of Validate is straight-forward. Let's use the example of adding a correlation (This feature is already included as the correlation between estimated SNP weights and actual "truth" SNP weights.)

1. To add this measure, we would want to make sure to import a package that contained a correlation. If not, we could write our own correlation function which would be simple enough but for now, we will assume you have no desire to reinvent the wheel and simply choose to use the pearson correlation available through the scipy module. So at the beginning of the file, we will add:

from scipy import stats

2. Then anywhere in this file, we could write the function to perform the correlation. We know from our object reference that we need to include the betaColumn and the betaTrueFalse column. The function might look something like this:

def r(betaColumn, betaTrueFalse):
     betaColumn = np.array(betaColumn)
     betaTrueFalse = np.array(betaTrueFalse)
     return stats.stats.pearsonr(betaColumn, betaTrueFalse)[0]

3. Finally, it needs to be added to the gwas functions in the gwas.py file, so that the correlation function we just created is called and saved during the process of validating a results file.

def gwasWithBeta(betaColumn, betaTrueFalse, snpTrueFalse, scoreColumn, threshold):
     return ["h", "rmse", "mae", "r", "r2", "auc", "tp", "fp", "tn", "fn", "tpr", "fpr", "error", "sens", "spec", "precision", "youden"], [h(snpTrueFalse, scoreColumn), rmse(betaColumn, betaTrueFalse), mae(betaColumn, betaTrueFalse), r(betaColumn, betaTrueFalse), r2(betaColumn, betaTrueFalse),auc(snpTrueFalse, scoreColumn), tp(snpTrueFalse, threshold, scoreColumn), fp(snpTrueFalse, threshold, scoreColumn), tn(snpTrueFalse, threshold, scoreColumn), fn(snpTrueFalse, threshold, scoreColumn), tpr(snpTrueFalse, threshold, scoreColumn), fpr(snpTrueFalse, threshold, scoreColumn), error(snpTrueFalse, threshold, scoreColumn), sens(snpTrueFalse, threshold, scoreColumn), spec(snpTrueFalse, threshold, scoreColumn), precision(snpTrueFalse, threshold, scoreColumn), youden(snpTrueFalse, threshold, scoreColumn)]

The first part of the return line in this function specifies a list of names. These names will be used in the header of the Validate results file. Just make sure that the number in the list corresponds to the function in the second list where the actual function is called. Finally, keep in mind that we are only altering the gwasWithBeta function, since we obviously could not analyze a GWAS application that did not include SNP weights.

1 Learning Materials

A Quick Guide for Further Developing Validate