Working with Sequencing Data

Sequencing data, whether RNASeq data, genome sequencing data, or doing other related studies, has become key to unlocking a lot of information about an organism in a few experiments. In case it's not obvious, sequencing data is biological data. If you are new to working with sequencing data, then you might find the following helpful.

Sequencing data – what does it look like?

Sequencing data commonly comes in the form of text files, just like many Linux files.
The content of the files is organized into a specific format, most commonly fastq format.
Fastq is similar to fasta files, but includes quality information for each base of sequence entered.
There are 4 lines of text for each sequence entered into a fastq file, and you can see these just by clicking on your file in the DE to open it for viewing. It's a good idea to look at your data sometimes, so you know what is going on.
This is what those 4 lines generally look like in a newer Illumina file (HiSeq reads):
@HWI-ST700677:123:C0MBEACXX:7:1101:1087:2556 1:N:0:CAGATC
CAGAGAAGAAGTACCCAAAAATATCAGCAAGACGATGAAGAAGAAGACGACTGTATGGATCCATTGGCTCCAACTCTAAGATCCCATCTATGCTGCAAAC
+
BCCFFDFFHHHDFHJJJJJJJJJJJJJJJIJJJJHIHIIJHIJGIIJJJJJIJHIJJJJJJJJHHHHHGFFDFEECEEEEDDDDDCDDCDEEDEDDDCDD
The first line for a read starts with "@" and is followed by a series of entries, including commonly a space and then a 1 or 2 for paired read files. The second line gives the actual sequence. The third line just has a "+". And the last line has the exact same number of characters as the first, but the entries are hexadecimal quality scores. Typically these are Sanger formatted reads and scores.
In all the information on the name or header line for a fastq read, there may be information as to overall quality of the read or a tag sequence that is used for separating multiplexed files back into its components.
Now that you have your sequencing data, and you have some idea what the contents of the files are, what do you do with it? You may be tempted to map it to a genome for RNASeq studies, or maybe you want to assemble genome sequencing data into a new genome sequence. Whatever the reason you had for sequencing the sample in the first place, you should avoid jumping into analysis immediately. Looking at the data with your eyes can tell you something about format and read length, etc., but it won't tell you how reliable your data is. The very first thing to be done for any type of study is to analyze the quality of the data with FastQC or Prinseq Graph. And then you will probably need to trim the reads to make sure you are using reliable sequence for mapping and assembly. The most comprehensive way to do the necessary analysis and trimming is to use the first steps of the HTProcess pipeline. HTProcess-prepare_directories_and_run_fastqc will give you a broad, graphical view of your different read files, and HTProcess_trimmomatic will help you trim the reads of extraneous sequences (e.g. adapters or primers) and of low quality, unreliable sequence.
Viewing an older Illumina sequencing file in the DE:

Note that the name format is simpler and that the pair number (1) comes after a forward slash.

More about fastq files:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217/

https://en.wikipedia.org/wiki/FASTQ_format

1 Learning Materials

Working with Sequencing Data

Sequencing data – what does it look like?

Related articles

Filter by label