myCoGe Data-Integrator: An Automated Genomics Data Integration Pipeline

Abstract

The sharp decrease in the cost combined with the greater speed of genome sequencing has paved the way for an era of personal genomics. With affordable genome sequencing in the reach of individuals, huge growths in data are being seen, largely funded by private companies and interested individuals. With this growth in data from disparate sources, new ways to compare these data are needed. The objective of the myCoGe data-integration pipeline is to obtain publicly available human variant (SNP) experiment data and load them into the comparative genomics platform CoGe (www.genomevolution.org/coge). Conversions are performed that allow all individual experiments to be compared directly against each other, providing a foundational data set for powerful variance analyses. All experiments are publicly available through the myCoGe notebook on CoGe (https://goo.gl/3vneI9).

Introduction

The information age has brought with it changes in nearly all areas of society, ranging from social interactions, to military affairs, and to health and medicine. Of particular interest in health and medicine fields is the rapid advancement of genomic and genetic sequencing capabilities. New “next-gen” genetic sequencing technologies are causing a rapid decrease in the cost of sequencing a genome, forever changing the ways many researchers seek to answer questions (Figure 1) [1]. These decreased costs not only benefit traditional research fields, but have also allowed for a new field to emerge: personal genomics. Personal genomics refers to the sequencing of individual’s genomes for health, ancestry, and other inquiries. Personal genomics is an integral part of personalized (or “precision”) medicine, and has shown great promise in the treatment of complex diseases and conditions. When personal genomics is combined with other technological advances, medical researchers are now poised to solve problems previously regarded as nearly impossible.

For the first ten years of the Human Genome Project, the speed at which Sanger-based capillary sequencing generated sequence data was the rate-limiting factor [2]. Fortunately, in 2005 new (“massively parallel”/”next-gen”) sequencing technologies started becoming available. Today, generating data is no longer limiting, and analysis has become the rate-limiting factor in genomic studies. This point can be effectively illustrated by looking at the projects that followed the Human Genome Project and aimed to investigate variation in the human genome. The 1000 Genome Project was the first large sequencing project based on next-gen technologies, with a goal to identify the genetic variations that occur in 1% or more of the human population [3]. In the first 6 months of the 1000 Genome Project, more sequencing data was added to the GenBank repository than in previous 30 years [4]. Another large project, the “HapMap” project, also began around this time with the aim of identify common single-nucleotide polymorphisms (SNPs). The HapMap project identified over 10 million common SNPs across the human genome, all of which were added to the National Center for Biotechnology Information SNP Database (NCBI dbSNP) and available for query [5].

This large-scale identification of variance across the genome has allowed for many studies and technological advancements, including genome-wide association studies, in which specific shared-loci are identified in patients with complex diseases and conditions, and the development of variant microarrays, cost-effective technologies that can identify variation in individuals. Recently, private companies, such as 23andMe (www.23andMe.com) and AncestryDNA (dna.ancestry.com), have adopted these microarrays to offer ethnic, health, and other information to individuals based on their genetic makeup. At the same time, these companies are generating huge volumes of genetic data without relying on grant funding. While not all commercially gathered data is available for researchers, many individuals choose to make their data public. In fact, Forbes recently stated that of 23andMe’s 800,000 customers, 600,000 consented to having their data shared with researchers [6]. Other resources, such as the Personal Genome Project (PGP) (www.PersonalGenomes.org) and Open Humans (www.openhumans.org), provide means for individuals to share their genetic data and health history with researchers.

The objective of the myCoGe data integration pipeline is to integrate publicly available SNP variant data and health records into the comparative genomics platform CoGe. Initial record will be taken from the Personal Genome Project (PGP), while future versions will pull from additional resources. The data integration pipeline is one component of the myCoGe project, which aims to provide a framework of tools and data sets that allow for professional researchers, as well as citizen scientists, to investigate how variation affects the human genome. MyCoGe will integrate variance data, functional data, gene model annotations, and advanced querying tools to accomplish this goal.

Figure 1: Rapid decline in sequencing costs. A chart of sequencing costs from 2001 to 2014. Values obtained from www.genome.gov [1]

Implementation

The myCoGe pipeline is an automated data-integration pipeline that obtains publicly available variant (SNP) data sets from the Personal Genome Project and loads them into the comparative genomics platform CoGe. The underlying code is written mostly in Python 2.7, but also utilizes a basic bash shell script to link some parts of the pipeline together, creating a seamless, automated execution only requiring user input when unknown file types are encountered. Researchers and individuals interested in utilizing the experiments from the myCoGe pipeline need not interact with the pipeline, but can access all data directly on CoGe through the myCoGe notebook (https://goo.gl/3vneI9). All source code for the myCoGe pipeline is open source and publicly available through GitHub (https://github.com/asherkhb/myCoGe).

Figure 2: The myCoGe Pipeline. Schematic outline of the myCoGe data-integrating pipeline. Green shading indicates specific scripts. Detailed steps in the pipeline are shown beneath each step.

While the pipeline is complex (Figure 2), it can be best explained by separating the process into 5 major steps. First, a list of all applicable experiments and their associated metadata is generated based on the PGP website. After a metadata/experiment list is generated, the raw data for any new experiments (those not already obtained from the pipeline) are downloaded directly to the development server. The experiments are then converted to a common format (VCF) and reformatted to a common reference genome (GRCh38). After processing is complete, the experiment/metadata sets are loaded into the iPlant Data Store and the CoGe servers. Finally, a log of the newly integrated experiments and other performance reports from the iteration are generated and emailed to the system administrators. Data sets are then available for use on CoGe, through the myCoGe notebook.

Generating Experiment/Metadata Lists

The PGP lacks any official API, so obtaining a list of all relevant experiments and their associated metadata is accomplished via “web scraping”. The web scraper (snpscraper.py, now referred to as SNPScraper) is written in Python 2.7 and built on the Scrapy framework (http://scrapy.org/). For each experiment that meets our criteria (SNP variant data, from select sources currently limited to 23andMe and AncestryDNA), the associated huID (unique identifier of the individual who contributed the experiment to the PGP), the experiment download link, huID profile link, and any contributed health records are pulled from the PGP website. These records are compiled into a JSON object that is used downstream in the pipeline.

Downloading Experiments

Experiment IDs and download links are extracted from the JSON object, compared against a log of already obtained experiments, and compiled into a download queue. Experiments are downloaded sequentially, to reduce load on the PGP servers, but future versions will parallelize the download processes. In the case of compressed files, the pipeline automatically decompresses and renames the files to match our conventions (experiments are named by their huID, and stored in a .txt format).

Converting Experiments

One challenge in comparing and understanding experiments from different sources arises from differences both in file format and the associated reference genome. Utilizing an “experiment” class, the myCoGe pipeline automatically corrects for both these ailments, allowing direct comparison of data from different times, sources, and providers. Additionally, these conversions allow for comparisons with the the most up-to-date reference human genome, GRCh38.

Most commercial services offering SNP sequencing provide the raw data to users in a tab-separated format, but the format varies slightly from each provider. The experiment class allows for automatic detection of file format, and automatic conversion to the CoGe-preferred Variant Call Format (VCF) v.4.2. This conversion between these formats requires reformatting and applying missing metadata, rearranging information into a specified 8-column tab-separated format, and obtaining a reference allele for each SNP. Reference alleles are not usually provided in these raw experiment data sets; so they are obtained from NCBI by comparison with the NCBI dbSNP build 142.

Each new version of the human genome brings significant changes to location indices, and aligning and comparing different sets of data across different reference genomes can be quite challenging. To overcome this problem, the pipeline utilizes a master reference of all identified variations in the human genome to update the locations and reference alleles based on the GRCh38 reference human genome. This allows experiments to be directly compared regardless of their original source, and allows comparisons to occur over the most up-to-date human reference genome.

Loading Experiments

After experiments have been downloaded to the development server and processed into the correct format, they are transferred to the iPlant Data Store (www.iplantc.org). The iPlant Data Store is an ideal location for the storing of this data, due to its large capacity, availability across many different platforms, redundant backups, and fast transfers through the iRODS infrastructure. In addition, CoGe can directly access the datastore, greatly improving the speed of imports. Each experiment is loaded into CoGe using the CoGe API, and by providing relevant metadata and the location of the experiment file on the datastore. All experiments are loaded into the myCoGe notebook where they can be accessed by interested researchers and individuals.

Generating Log Reports

Because myCoGe is an automated pipeline, it is essential that relevant information about each run of the pipeline is stored and automatically sent to the system administrators for review. After each run of the pipeline, an email is composed that contains a list of all newly-incorporated experiments, the experiment/metadata JSON from SNPScraper, preserved (“pickle”) formats of important data structures, and a log of all standard out and standard error that occurred over the iteration. Additionally, any unidentified file types are stored in a unique folder that is accessible by administrators, allowing for targeted addition of new, previously unsupported experiment formats.

Results

The myCoGe pipeline was installed onto the GECO server at the University of Arizona and manually executed using the control script. Future execution of the myCoGe pipeline will be automatically executed using the Cron job scheduler. Automatic execution will occur once per week, allowing any newly added experiments to be automatically imported into the myCoGe notebook. From the initial run, the list of experiments generated contained 576 experiments matching our criteria. It is expected that as more file type support is added, this number will significantly increase, with a project goal of obtaining over 1000 publicly available data sets. To date, only 81 of these experiments have been successfully downloaded, converted, and loaded into the myCoGe notebook. The remaining files are on queue for download, with PGP server bandwidth limiting the download of the raw experiment files. All experiments can be viewed through the myCoGe notebook (https://goo.gl/3vneI9).

Discussion

The rapid advancement of genomic sequencing technologies has led to a massive accumulation of genomic data, which when managed correctly and queried efficiently, offers the potential to answer many biological questions. This data is of particular interest in the medical fields, as it has shows promise for reducing complex disease and conditions, and offering more targeted, personalized medical care. Unfortunately, these valuable data sets are spread across many databases and online resources, and are not easily integrated with each other. The myCoGe pipeline will help alleviate this problem by integrating many types of experimental data into a comparative platform where they can be directly compared.

While the myCoGe pipeline is a fully functioning data-integration pipeline, additional developments are still in progress. The experiment class will see updates that expand the range of data types and sources that are accepted, increasing the total experiment count to over 1000. Multi-threaded downloads will be initiated, as to speed the data-obtain step, but only after technical issues surrounding bulk downloading from the PGP have been resolved. Additional efficiency and stability updates will be added, with the goal of producing a robust end product capable of long-term self-reliance.

The myCoGe import pipeline is the first step in the broader myCoGe project, with an end goal of providing a powerful, robust tool set for individuals and researchers. These tools and associated data sets will form a framework for inspecting and comparing commercially produced genomic variation data sets with extensive functional experimental data, building on the advanced comparative environment CoGe to investigate how variation affects functioning within the genome. A unique function of the myCoGe project is that it will allow individuals who have purchased SNP sequencing from a variety of sources integrate their own data, investigate it themselves, and share it anonymously with professional researchers.

References

1. Wetterstrand K: DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP). .

2. Mardis ER: A decade’s perspective on DNA sequencing technology. Nature 2011, 470:198–203.

3. 1000 Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA: A map of human genome variation from population-scale sequencing. Nature 2010, 467:1061–1073.

4. Stein LD: The case for cloud computing in genome informatics. Genome Biol 2010, 11:207.

5. International HapMap 3 Consortium, Altshuler DM, Gibbs RA, Peltonen L, Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, Yu F, Peltonen L, Dermitzakis E, Bonnen PE, Altshuler DM, Gibbs RA, de Bakker PIW, Deloukas P, Gabriel SB, Gwilliam R, Hunt S, Inouye M, Jia X, Palotie A, Parkin M, Whittaker P, Yu F, Chang K, Hawes A, Lewis LR, Ren Y, et al.: Integrating common and rare genetic variation in diverse human populations. Nature 2010, 467:52–58.

6. Herper M: Suprise! With $60 Million Genentech Deal, 23andMe Has A Business Plan. Forbes 2015:Herper.

PLS 599 2014

Baltzell - Final Report