Baltzell - Weekly Updates
4/21/15
Completed this Week:
"Experiment" Class Expansions:
- "Class" which holds different SNP experiments. Has features:
- "Class" experiments using:
- <experiment_name> = SnpExperiment('<huID>', '<raw_experiment_file>', '<new_vcf_file>')
- i.e. experiment23 = SnpExperiment('hu4D906B', './hu4D906B.txt', './vcfs/hu4D906B.vcf')
- Class Properties:
- id = huID
- file_path = path to raw experiment file
- file_source = source of experiment (see below)
- vcf_reference_genome = reference genome for converted VCF
- vcf_path = path to converted VCF
- Identifies "source" of experiment
- Currently supports: 23andMe, AncestryDNA, Generic 4-column SNP TSV, Generic 5-column SNP TSV
- Source can then be called using <experiment_name>.file_source
- Convert to VCF
- Based on the identified source, use appropriate VCF conversion function.
- Called using <experiment_name>.convert_to_vcf()
- "Class" experiments using:
- Experiment class is now integrated into myCoGe. Tested and functional with 23andMe, AncestryDNA, and generic 4-column TSV datasets
For Next Week:
Load data onto CoGe! (ASAP, soon as CoGe is running)
Also, I would like to add
1. an error check which funnels "Unidentifiable" file types to a special folder for manual review, so that new formats can be identified and incorporated.
2. Email data report summary (Eventually, need full pipeline first)
4/14/15
Completed this Week:
"Experiment" Class: Holds experiments, determines what the file-source is, then converts to a VCF. All SNPs are updated to refer to the latest reference human genome.
Things learned about:
- Indexing: holds locations in a file, so access is easy (important because refernce file is >20gb, dont want to run though line by line every time)
- Pickle: Stores python data structures as a file, they can then be reloaded for access.
- cPickle: Same as pickle, but way faster.
Pipeline: Functional portions installed on GECO.
For Next Week:
Complete experiment class, ensure complete functioning for 23andMe and AncestryDNA files, as well as unclassified types.
Install complete initial pipeline on GECO.
3/31/15
Completed this Week:
- SNPScraper: Complete 4/3/15
- Obtains all 23andMe data from PGP
- Includes: huID, profile link, experiment download link, sequencing source (23andMe), and all health records (HTML table)
- The myCoGe Pipeline:In Progress
myCoGe Pipeline: Current Status
myCoGe Functional Filetree
To-Do Next Week:
- Load data into CoGe notebook
- Develop way to replace "_directory.txt" with a notebook query for experiment
- Import log development, email functionality
3/31/15
Completed this Week:
- SNPScraper:
- Fixed (was broken somehow)
- Updated:
- Perform data-cleansing functions that were previously in "json-decoder", simplifying and streamlining the pipeline.
- Obtains profile links.
- Compiled components into a (semi-functional) pipeline:
- myCoGe.py - library containing all necessary functions.
- initiate_myCoGe.py - script which executes all functions in order. Consists of 9 steps:
- Execute SNPScraper: (functional)
- Scrapes *** for SNPdatasets
- Returns JSON of huIDs, download links, and health info
- Decode JSON: (functional)
- Converts JSON into a Python Dictionary
- Returns Dictionary of huIDs, download links, and health info
- Compare dictionary with File Directory (_directory.txt) (TODO)
- Compares scraped data with a directory of previously scraped data
- Returns a modified dictionary of only the missing data.
- Download Data (functional)
- Downloads missing data.
- Update File Directory (TODO)
- Updates file directory to include newly downloaded data.
- Generate Metadata (functional)
- Generates metadata file for new data
- Exports in a .tsv format.
- Convert to VCF. (functional)
- Converts TSV SNP datasets to VCF.
- Transfer to iRODS (tenatively functional)
- Transfers new VCF to iRODS ("iPlant Datastore")
- Batchload into CoGe (TODO)
- Execute SNPScraper: (functional)
To-Do Next Week:
- Finalize SNPScraper (Due Date: 4/3/15)
- Complete missing scripts (Due Date: 4/5/15)
- #3, 5, 9
- Implement functional script into Geco (Due Date: 4/6/15)
- Import first set of data into CoGe. (Due Date: 4/7/15)
Up Next: Render data visualizations in JBrowse
3/24/15
Completed this Week:
- Restructured obtain-data.py
- Uses updated data format
- Downloads files, unzips and renames when necessary
- Started work on SNPScraper to obtain health records - not complete.
- Reformatted many of the scripts - most processes now as functions, and (almost) all comply with PEP8 style conventions.
To-Do Next Week:
- Finalize SNPScraper for health record obtaining (Due Date: 3/27/15)
- Work on a script to load into iRODS. (Due Date: 3/29/15)
- Compile components into single workflow (Due Date: 3/31/15)
- Assess any missing components, create list for next week.
3/10/15
Completed this Week:
- Developed metagenerator.py - a library with two functions that allow for a metadata file to be generated for multiple datasets (in TSV, as dictated by CoGe). This is one component of the batch uploading requirements.
- Studied the PGP website to determine how to obtain health record information associated with each data set.
- Started learning how to expand SNPScraper to obtain this data.
- Linked PyCharm with GitHub for easy version management. Also, wrote a subpage on how this works for those who are interested.
To-Do Next Week:
- Restructure obtain-data.py to use updated data format. (Due Date: 3/13/15)
- Determine how to upload the obtained datasets to iRODS, incorporate this into a functions or scripts. (Due Date: 3/15/15)
- Work on a script to batchload into CoGe. (Due Date: TBD)
3/1/15
Project
It has been a rather productive Sunday. Here is what is done:
- Working web-crawler developed that extracts all available 23andMe SNP Variant data set download links, along with their associated huIDs from PGP's website. Scraped data is exported as a JSON object.
- Wrote a script to convert the JSON object into a python dictionary (for future processing).
- Figured out how to issue command-line commands through python, using the "subprocess" module.
Code
1. snpscraper/ - Folder containing current working version of the SNPScraper web-crawler that extracts 23andMe variant data from PGP. Built on Scrapy.
- ./snpscraper/items.py - Items portion of the SNPScraper (GitHub: items.py)
- ./snpscraper/spiders/snp_spider.py - Web-crawler portion of SNPScraper (GitHub: snp_spider.py)
2. json-decoder.py: A short script which converts the SNPScraper ouptut JSON object into a python dictionary with a { "huid": "Download Link", [...] } structure. (GitHub: json-decoder.py)
3. sub.py: A short script I used to figure out how to call command-line commands through Python. In this example, it simply runs the json-decoder.py (GitHub: sub.py)
2/27/15: Friday Update
Project Update
Over this week, I have been working out some final details on how this project will work and what exactly is necessary. I have been in contact with both Matt (lead CoGe developer) regarding how some of these items will integrate into CoGe, and Madeleine (Senior Research Scientist with the Personal Genome Project) about accessing and downloading their data.
Brief summary of conversation with Matt:
- All SNP data from the Personal Genome Project will be batch loaded as experiments into a designated "Notebook".
- Each week, this notebook will be compared with the Personal Genome Project, and any additional datasets will be added ("differences resolved")
- A custom jBrowse "rendering" will analyze SNP content in real time as people browse through the genome (See jBrowse Integration Concept|../../../../../../../../../download/attachments/14583189/jbrowse_2-24-15.png?version=1&modificationDate=1424812774000|||||||||||\)
Brief summary of conversation with Madeleine:
- SNP data exists in a number of formats in the PGP servers, from multiple sources including (not necessarily limited to) 23andMe, Family Tree DNA, AncestryDNA, and Genographic. Additionally, CompleteGenomics includes SNP data, but is accessed off a different server (a Google cloud bucket).
- There is no easy way to access the SNP data on the PGP servers (no API), so I will need to write a web-scraping script to gather this data.
- PGP may create JSON representations of their profiles (those people who have contributed data), which would make this way easier. However, they don't know if and when this will be done, so for now I will be moving forward with web scraping.
Looking Forward: This weekends main objective is going to be getting this web-scraping started (and hopefully finished, but I need to see what I am getting myself into first). This aspect has been officially termed "SNPScraper". See the SNPScraper sub-page|../../../../../../../../../display/PLS599/SNPScraper|||||||||||\ to document this sub-project.
2/24/15
Code
1. file-detect.py v.1.0.0 - Detects between 23andMe data, and Family Tree DNA data, with easy additions for more (as I get conversion scripts working). (GitHub: file-detect.py)
2. obtain-data.py v.0.1.0 - Automatically pulls data from PGP, currently using a document in tab delineated format. Still under development. (GitHub: obtain-data.py)
Project
Conversion scripts are coming along, in addition to an automated downloader. Individual scripts are largely working, but now focus will be on switching towards a larger integrated program.
jBrowse Integration Concept uploaded (v.1.0) (see above).
2/17/15
Code
1. 23me2VCF.py updated to v.1.0.1, with one bug fixed. (https://github.com/asherkhb/PLS599/blob/master/23me2vcf.py)
2. Began scripting input file type detection script. Eventually, this script will allow many different types of personal genetic data files to be automatically detected and converted to appropriate formats. Will be integrated into a larger suite of tools that compromise the "process/format" section of phase 1(see "Process/Format" Concept Map).
Project
Yesterday, got sample data uploaded to CoGe, and then tried to visualize in JBrowse. It wasn't working (experiment was not displayed), so I contacted Matt. Matt investigated, and fixed a bug in CoGe that was causing the data to not be displayed. SNPs from example data set can now be seen overlayed on human genome v.37.74.
2/15/15
Code
Script for converting human variant data tab separated values (TSV) format to Varient Call Format 4.2 (VCF) v.1.0.0 completed, and working.
Find script on GitHub: https://github.com/asherkhb/PLS599/blob/master/23me2vcf.py
- 23me2VCF.py:
- Things Learned:
- strftime() - Used to format times into desired layout. In VCF formats, date of generation is required in metadata, in format year-month-day, with year in century format and month/day both zero padded (i.e. February 15th, 2015 would be 20150215). Using the strfshift function, getting the code into this format was very easy
- ' \r ' - A really annoying other new line character. It was causing bugs, and thus needed to be stripped (using the string.strip() function).
- Things Learned:
Project
First Sample Dataset (from conversion script) uploaded to CoGe manually, using LoadExperiment. Still working on viewing this experiment.