Update - 4-21-15

Update - 4-21-15

Update - 4/21/15

  • I've been working on my script to import Bismark's Methylation Extractor ouput file into CoGe.
  • The program provides separate files for each sequence context (CG, CHG, and CHH).
  • After stripping out the header and sorting by chromosome and position this is essentially the format (tab delimited):
  • What I need is an output in this format for loading into CoGe (csv or tsv):
    Chr#,Position,% Methylation (as decimal),# of methylated out of total reads
  • I played around with this for a while, but didn't get very far. Progress is on my github.

Rough Pseudocode:

Read in tsv

Make a dictionary

Iterate over the rows in the tsv and find entries for which chromosome position (col 3 and 4) is the same, store this as a tuple (Chr#, Position).

For each item row with the same position count the number that have a + and a - in column 2, divide count of + by the total.
Store the fraction of methylated (+) reads and total number of reads for that position as a list [methylated, total]

Append each to dictionary, the tuple will be the key (immuatable), list will be the value.

Write dictionary to output file

Does this sound like a reasonable plan? Ideas welcome.