Asher Baltzell - Homework 1

Homework #1: FASTQ --> FASTA Conversion

While there are many ways to convert FASTQ files into FASTA files, I chose to write a simple python script titled "fastq2fasta.py". The BioPython library already has a number of functions which allow for easy manipulation of biological sequence data, specifically the SeqIO package is very useful for such manipulations. Using the Bio.SeqIO.convert() function allows conversion directly between many different formats with a single line of code.

Using this, along with some command-line argument parsing and file writing scripting, I created a simple python script that converts a FASTQ file into a respective FASTA file.

Source code can be found in My GitHub Repository for the course: https://github.com/asherkhb/PLS599

Here it is:

Script: The majority of the code in the script is for parsing commands so that the user can specify which file (and where) they would like to convert.

Of interest to the core functionality of the script is lines 39 to 43:

Explained:

  • with open(inputfile, 'r' as inpt:
    • Opens the file specified by the user at the command prompt and stores as in read-only mode as "inpt"
  • with open(outputfile, 'w') as otpt:
    • Create and opens a file in write mode, sharing a name with the input file (lines 28 to 31) but with a .fastq file format
  • SeqIO.convert(inputfile, "fastq", outputfile, "fasta")
    • Using the SeqIO package contains the .convert function which can easily convert between different biological sequence data types.
    • Usage: Bio.SeqIO.convert(in_file, in_format, out_file, out_format)
    • This opens the input file, which is specified as a FASTQ, then converts to "outputfile" as a FASTA
  • otpt.write(outputfile)
    • Writes the newly converted file to a new file
  • print('Your FASTQ has been converted. See %s') % (outputfile)
    • Informs user that the operation has been completed and tells them the name.
    • %s is used as a placeholder, and the following % (outputfile) tells the computer to replace the %s with the title of the output file.

I downloaded some test FASTQ files from https://wiki.cgb.indiana.edu/display/isga/Sample+FastQ+Files

Here are the results...



Some Resources I Used:

* Install GCC: http://stackoverflow.com/questions/11442970/numpy-and-scipy-for-preinstalled-python-2-6-7-on-mac-os-lion

* Download/Install NumPy and SciPy: http://www.scipy.org/scipylib/building/macosx.html

* Download/Install BioPython: http://biopython.org/wiki/Download

* SeqIO Documentation: http://biopython.org/wiki/SeqIO

* Convert in SeqIO: http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html#convert