MAKER Genome Annotation using cc-tools and Jetstream (WQ-MAKER)

This is the old version of WQ-MAKER. Please use this latest version - MAKER 2.31.9 with CCTOOLS Jetstream Tutorial

Rationale and background:

MAKER is a flexible and scalable genome annotation pipeline that automates the many steps necessary for the detection of protein coding genes (Campbell et al. 2013). MAKER identifies repeats, aligns ESTs and proteins to a genome, produces ab initio gene predictions, and automatically synthesizes these data into gene annotations having evidence-based quality indices. MAKER was developed by the Yandell Lab and is described in several publications (Cantarel et al. 2008; Holt & Yandell 2011). Additional background is available at the MAKER Tutorial at GMOD and is highly recommended reading.

MAKER with CCTools (aka WQ-MAKER) is a modified MAKER annotation tools capable of running MAKER on distributed computing resources such as Jetstream cloud (Thrasher et al., 2012). Using the work-queue platform, users can now run MAKER across multiple virtual machines to achieve a several fold reduction in the duration of the MAKER run.

This tutorial will take users through steps of:

Running WQ-MAKER on Jetstream cloud
Running WQ-MAKER on an example genome assembly data

Considerations

Sounds great, what do I need to get started?

XSEDE account
XSEDE allocation: Users can fill this Jetstream cloud allocation through CyVerse form for running WQ-MAKER (quick-start). Later on, they can request a startup XSEDE allocation.
Your data (or you can run example data)

What kind of data do I need?

Mandatory requirements
1. Genome assembly (fasta file)
2. Organism type
  1. Eukaryotic (default, set as: organism_type=eukaryotic)
  2. Prokaryotic (set as: organism_type=prokaryotic)
Additional data that can be used to improve the annotation (Highly recommended)
1. RNA evidence (at least one of them is needed)
  1. Assembled mRNA-seq transcriptome (fasta file)
  2. Expressed sequence tags (ESTs) data (fasta file)
  3. Aligned EST or transcriptome GFF3 from your organism
  4. Aligned EST or transcriptome GFF3 from a closely related organism
2. Protein evidence
  1. protein sequence file in fasta format (i.e. from multiple organisms)
  2. protein gff (aligned protein homology evidence from an external GFF3 file)

What kind of resources will I need for my project?

Enough storage space on the WQ-MAKER Jetstream instance for both input and output files
1. Creating and mounting an external volume to the running WQ-MAKER MASTER instance would be recommended
One Master and several workers needed for running your computation
1. Benchmarking results for data sets can help you estimate the number of workers need for running your annotation
Enough AUs to run your computation

Part 1: Connect to an instance of an WQ-MAKER Jetstream Image (virtual machine)

Step 1. Go to https://use.jetstream-cloud.org/application and log in with your XSEDE credentials.

Step 2. Click on the "Create New Project" in the Project tab on the top and enter the name of the project and a brief description

Step 3. Launch an instance from the selected image and name it as MASTER

After the project has been created and entered inside it, click the "New" button, select "MAKER 2.31.8 with CCTools" image and then click Launch instance. In the next window (Basic Info),

name the instance as "MASTER" (don't worry if you forgot to name the instance at that point, as you can always modify the name of the instance later)
set base image version as "2.0" (default)
leave the project as it is or change to a different project if needed
select "Jetstream - Indiana University or Jetstream - TACC" as Provider and click 'Continue'. Your choice of provider will depend on the resources you have available (AUs) and the needs of your instance
select "m1.medium" as Instance size (this is the minimum size that is required by WQ-MAKER image) and click "Continue".

Make sure to check on the Projected Resources Usage to make sure you have enough resources to run the instance. If you need more AUs or CPUs to run instances, contact the Jetstream team at support@iplantcollaborative.org. Review the details of the instance you are selecting to launch and click "Launch Instance"

Step 4. Click on the MASTER instance, and launch a web shell

or ssh into the ip address on a terminal.

$ ssh upendra@<ip.address>

Step 5. Add public SSH key of MASTER to Jetstream

If you do not already have a ~/.ssh/id_rsa.pub file, then run this command to create it. Skip this step if you already have it.

$ ssh-keygen

Copy the public SSH key from your id_rsa.pub file and paste it to the https://use.jetstream-cloud.org/application/settings, give a name to it and click confirm.

cat ~/.ssh/id_rsa.pub

Step 6. Launch WORKER instances from MAKER 2.31.8 with CCTools image

Launch one to several instances from the MAKER 2.31.8 with CCTools image and name them as WORKER-1, WORKER-2.... by following step 3.

By default, WQ-MAKER runs 10 sequences from FASTA file as a single job on each WORKER instance, however you can specify the number of sequences created from FASTA file. So depending on the number of chromosomes/contigs/scaffolds and your allocation AU's, launch the number of workers accordingly. The maximum number of WORKERS can be same as the number of sequences of FASTA file. For the current tutorial we will launch 4 worker instances (the FASTA file contain 12 sequences and so we can launch 12 WORKERS and finish the job early)

Step 7. As the instance is launched behind the scenes, you will get an update as it goes through each step.

Status updates of Instance launch (both MASTER and WORKER) include Build-requesting launch, Build-networking, Build-spawning, Active-networking, Active-deploying. Depending on the usage load on Jetstream, it can take anywhere from 2-5 mins for an instance to become active. You can force check updates by using the refresh button in the Instance launch page or the refresh button on your browser. Once the instance becomes active a virtual machine with the ip address provided will become available for you to connect to. This virtual machine will have all the necessary components to run WQ-MAKER and test files to run a MAKER demo.

Part2: Running WQ-MAKER on the volumes (Recommended)

Since the m1 medium instance size (60GB disk space) selected for running MASTER instance of WQ-MAKER may not be sufficient for most of the MAKER runs, it is recommended to run it on volumes

Step 1: Create a volume

Click the "New" button in the project and select "Create Volume". Enter the name of the volume, volume size (GB) needed and the provider (TACC or Indiana) and finally click "Create Volume"

Step2: Attach the created volume to the MASTER instance

Step 3: Mount the volume to a specified drive.

Once you have logged in to your instance using webshell or ssh of your MASTER instance, you must mount your attached volume to access it

# Change the ownership and group permission on the mount location
$ sudo chown -hR <username> /vol_b
$ sudo chgrp -hR <username> /vol_b
# cd into the /vol_b and then run WQ-MAKER in there
$ cd /vol_b

Part 3: Set up a MAKER run using the Terminal window

Step 1. Get oriented. You will find staged example data in "/opt/MAKER_example_data/" within the MASTER instance. List its contents with the ls command:

$ ls /opt/WQ-MAKER_example_data/
maker_bopts.ctl  maker_exe.ctl  maker_opts.ctl  README  test_data
$ ls /opt/WQ-MAKER_example_data/test_data
mRNA.fasta  O.sativa.hmm  test_genome.fasta_000000  test_genome.fasta_000003  test_genome.fasta_000006  test_genome.fasta_000009
mRNA.fasta_nonNucl plant_repeats.fasta  test_genome.fasta_000001  test_genome.fasta_000004  test_genome.fasta_000007  test_genome.fasta_000010
msu-irgsp-proteins.fasta test_genome.fasta

maker_*.ctl file are a set of configuration files that can be used for this exercise or generated as described below.
fasta files include a scaled-down genome (test_genome.fasta) which is comprised of the first 300kb of 12 chromosomes of rice. Each of the test_genome.fasta_* contains a single chromosome.
mRNA sequences from NCBI (mRNA.fasta and mRNA.fasta_nonNucl)
publicly available annotated protein sequences of rice (MSU7.0 and IRGSP1.0) - plant_repeats.fasta & msu-irgsp-proteins.fasta
and a collection of plant repeats (plant_repeats.fasta)

Executables for running MAKER are located in /opt/maker/bin and /opt/maker/exe:

$ ls /opt/maker/bin/
cegma2zff       fasta_tool         maker           maker_functional        map_fasta_ids
chado2gff3      genemark_gtf2gff3  maker2chado     maker_functional_fasta  map_gff_ids
compare         gff3_merge         maker2eval_gtf  maker_functional_gff    mpi_evaluator
cufflinks2gff3  iprscan2gff3       maker2jbrowse   maker_map_ids           mpi_iprscan
evaluator       iprscan_wrap       maker2wap       map2assembly            tophat2gff3
fasta_merge     ipr_update_gff     maker2zff       map_data_ids  
         
$ ls /opt/maker/exe/
augustus  blast  exonerate  RepeatMasker  snap

As the names suggest the "/opt/maker/bin" directory includes many useful auxiliary scripts. For example cufflinks2gff3 will convert output from an RNA-seq analysis into a GFF3 file that can be used for input as evidence for WQ-MAKER. RepeatMasker, augustus, blast, exonerate, and snap are programs that MAKER uses in its pipeline. We recommend reading MAKER Tutorial at GMOD for more information about these.

Step 2. Set up a WQ-MAKER run. Create a working directory called "maker_run" on your home directory using the mkdir command and use cd to move into that directory:

$ mkdir maker_run
$ cd maker_run

Step 3. Copy the contents of "WQ-MAKER_example_data" into the current directory using cp -r command. Verify using the ls command.

$ cp -r /opt/WQ-MAKER_example_data/test_data .
$ ls 
test_data

Step 4. Run the maker command with the --help flag to get a usage statement and list of options:

$ maker --help

MAKER version 2.31.8

Usage:
     maker [options] <maker_opts> <maker_bopts> <maker_exe>

Description:
     MAKER is a program that produces gene annotations in GFF3 format using
     evidence such as EST alignments and protein homology. MAKER can be used to
     produce gene annotations for new genomes as well as update annotations
     from existing genome databases.

     The three input arguments are control files that specify how MAKER should
     behave. All options for MAKER should be set in the control files, but a
     few can also be set on the command line. Command line options provide a
     convenient machanism to override commonly altered control file values.
     MAKER will automatically search for the control files in the current
     working directory if they are not specified on the command line.

     Input files listed in the control options files must be in fasta format
     unless otherwise specified. Please see MAKER documentation to learn more
     about control file  configuration.  MAKER will automatically try and
     locate the user control files in the current working directory if these
     arguments are not supplied when initializing MAKER.

     It is important to note that MAKER does not try and recalculated data that
     it has already calculated.  For example, if you run an analysis twice on
     the same dataset you will notice that MAKER does not rerun any of the
     BLAST analyses, but instead uses the blast analyses stored from the
     previous run. To force MAKER to rerun all analyses, use the -f flag.

     MAKER also supports parallelization via MPI on computer clusters. Just
     launch MAKER via mpiexec (i.e. mpiexec -n 40 maker). MPI support must be
     configured during the MAKER installation process for this to work though

Options:

     -genome|g <file>    Overrides the genome file path in the control files

     -RM_off|R           Turns all repeat masking options off.

     -datastore/         Forcably turn on/off MAKER's two deep directory
      nodatastore        structure for output.  Always on by default.

     -old_struct         Use the old directory styles (MAKER 2.26 and lower)

     -base    <string>   Set the base name MAKER uses to save output files.
                         MAKER uses the input genome file name by default.

     -tries|t <integer>  Run contigs up to the specified number of tries.

     -cpus|c  <integer>  Tells how many cpus to use for BLAST analysis.
                         Note: this is for BLAST and not for MPI!

     -force|f            Forces MAKER to delete old files before running again.
                         This will require all blast analyses to be rerun.

     -again|a            recaculate all annotations and output files even if no
                         settings have changed. Does not delete old analyses.

     -quiet|q            Regular quiet. Only a handlful of status messages.

     -qq                 Even more quit. There are no status messages.

     -dsindex            Quickly generate datastore index file. Note that this
                         will not check if run settings have changed on contigs

     -nolock             Turn off file locks. May be usful on some file systems,
                         but can cause race conditions if running in parallel.

     -TMP                Specify temporary directory to use.

     -CTL                Generate empty control files in the current directory.

     -OPTS               Generates just the maker_opts.ctl file.

     -BOPTS              Generates just the maker_bopts.ctl file.

     -EXE                Generates just the maker_exe.ctl file.

     -MWAS    <option>   Easy way to control mwas_server for web-based GUI

                              options:  STOP
                                        START
                                        RESTART

     -version            Prints the MAKER version.

     -help|?             Prints this usage statement.

Step 5. Create control files that tell MAKER what to do. Three files are required:

maker_opts.ctl - gives location of input files (genome and evidence) and sets options that affect MAKER behavior
maker_exe.ctl - gives path information for the underlying executables.
maker_bopt.ctl - sets parameters for filtering BLAST and Exonerate alignment results

To create these files run the maker command with the -CTL flag. Verify with ls:

$ maker -CTL
$ ls
maker_bopts.ctl  maker_exe.ctl  maker_opts.ctl  test_data

The "maker_exe.ctl" is automatically generated with the correct paths to executables and does not need to be modified.
The "maker_bopt.ctl" is automatically generated with reasonable default parameters and also does not need to be modified unless you want to experiment with optimization of these parameters.
The automatically generated "maker_opts.ctl" file does need to be modified in order to specify the genome file and evidence files to be used as input. You can use the text editor "vi" or "nano" that is already installed in the MASTER instance

If pressed for time a pre-edited version of the "maker_opts.ctl" file is staged in /opt/WQ-MAKER_example_data. Delete the current file and copy the staged version here. Then skip to Step 6.

$ rm maker_opts.ctl
$ cp  /opt/WQ-MAKER_example_data/maker_opts.ctl .

Otherwise open the maker_opts.ctl in a text editor of choice (only vi is installed on this image. But you can install any editor of your choice)

$ vi maker_opts.ctl

Here are the sections of the "maker_opts.ctl" file you need to edit. For more information about the this please check this The_MAKER_control_files_explained - Add path information to files as shown.

Do not allow any spaces after the equal sign or anywhere else

The files can be present in same the directory as the "maker_opts.ctl" or make sure you use the relative path if the files are present in other directories

This section pertains to specifying the genome assembly to be annotated and setting organism type:

#-----Genome (these are always required)
genome=./test_data/test_genome.fasta #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

The following section pertains to EST and other mRNA expression evidence. Here we are only using same species data, but one could specify data from a related species using the "altest" parameter. With RNA-seq data aligned to your genome by Cufflinks or Tophat one could use maker auxiliary scripts (cufflinks2gff3 and tophat2gff3) to generate GFF3 files and specify these using the est_gff parameter:

#-----EST Evidence (for best results provide a file for at least one)
est=./test_data/mRNA.fasta #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closely relate species in GFF3 format

The following section pertains to protein sequence evidence. Here we are using previously annotated protein sequences. Another option would be to use SwissProt or other database:

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=./test_data/msu-irgsp-proteins.fasta  #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file

This next section pertains to repeat identification:

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org= #select a model organism for RepBase masking in RepeatMasker
rmlib=./test_data/plant_repeats.fasta #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

Various programs for ab initio gene prediction can be specified in the next section. Here we are using SNAP set to use an HMM trained on rice.

#-----Gene Prediction
snaphmm= #SNAP HMM file
gmhmm= #GeneMark HMM file
augustus_species= #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=1 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=1 #infer predictions from protein homology, 1 = yes, 0 = no
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no

If you are using "augustus_species" option in the maker_opts.ctl file (see above), then you need to copy the augustus config file /opt folder to your current directory where all the other files are present

cp -r /opt/maker/exe/augustus/config ./

Step 6. Run WQ-MAKER

Before running MAKER, check to make sure all worker instances have become active.

On the MASTER instance, make sure you are in the "maker_run" directory and all of your files are in place and then run:

$ nohup wq_maker -contigs-per-split 1 -N maker_run_ud -d all -o master.dbg -debug_size_limit=0 -stats test_out_stats.txt > log_file.txt &

-contigs-per-split 1 splits the genome file into 1 contig/scaffold/sequence per file. By specifiying this option, we are telling wq_maker to split the genome file into 1 sequence per file. By default, the wq_maker splits the fasta file into 10 sequences per file and this case, it is not ideal because, there will be 2 files (1 containing chromosomes from 1-10 and the other containing 11-12). This will decrease the speed at the wq_maker annotates the genome. By default, WQ-MAKER runs one file on the
-N maker_run_ud sets the project name to maker_run_test. This is mandatory if we need to run WQ-MAKER. Make sure to use your own initials here (For example mine initials are "ud")
-d all Sets the debug flag for Work Queue. For all debugging output, try 'all'
-o master.dbg Sets the debug file for Work Queue
-debug_size_limit=0 Sets the byte wrap around on the debug file. 0 signifies it is never wrapped (Default it 1M)
-stats test_out_stats.txt Specifies the file were Work Queue master stats are written
log_file.txt captures the stdout

Wait for the MASTER to advertise master status to the catalog server before your run WQ-MAKER on the WORKERS (see below)

$ cat log_file.txt

Genome file not specified. Using the specification in the CTL files.
STATUS: Parsing control files...
STATUS: Processing and indexing input FASTA files...
Contigs per split : 1
Contigs found : 12
Total number of input files : 12
STATUS: Parsing control files...
STATUS: Processing and indexing input FASTA files...
STATUS: Setting up database for any GFF3 input...
A data structure will be created for you at:
/home/upendra/maker_run/test_genome.maker.output/test_genome_datastore

To access files for individual sequences use the datastore index:
/home/upendra/maker_run/test_genome.maker.output/test_genome_master_datastore_index.log

Mon Jan 23 21:37:37 2017 :: Work Queue debug flags set :: all.
Mon Jan 23 21:37:37 2017 :: Work Queue listening on port 9155.
Mon Jan 23 21:37:37 2017 :: Work Queue project name set to maker_run_test.
Mon Jan 23 21:37:37 2017 :: Submitting file ./test_data/test_genome.fasta_000002 for processing.
Mon Jan 23 21:37:37 2017 :: Submitted task 1 for annotating ./test_data/test_genome.fasta_000002 with command:  maker -g ./test_data/test_genome.fasta_000002 -base test_genome 
Mon Jan 23 21:37:37 2017 :: Submitting file ./test_data/test_genome.fasta_000009 for processing.
Mon Jan 23 21:37:37 2017 :: Submitted task 2 for annotating ./test_data/test_genome.fasta_000009 with command:  maker -g ./test_data/test_genome.fasta_000009 -base test_genome 
Mon Jan 23 21:37:37 2017 :: Submitting file ./test_data/test_genome.fasta_000006 for processing.
Mon Jan 23 21:37:37 2017 :: Submitted task 3 for annotating ./test_data/test_genome.fasta_000006 with command:  maker -g ./test_data/test_genome.fasta_000006 -base test_genome 
Mon Jan 23 21:37:37 2017 :: Submitting file ./test_data/test_genome.fasta_000010 for processing.
Mon Jan 23 21:37:37 2017 :: Submitted task 4 for annotating ./test_data/test_genome.fasta_000010 with command:  maker -g ./test_data/test_genome.fasta_000010 -base test_genome 
Mon Jan 23 21:37:37 2017 :: Submitting file ./test_data/test_genome.fasta_000005 for processing.
Mon Jan 23 21:37:37 2017 :: Submitted task 5 for annotating ./test_data/test_genome.fasta_000005 with command:  maker -g ./test_data/test_genome.fasta_000005 -base test_genome 
Mon Jan 23 21:37:37 2017 :: Submitting file ./test_data/test_genome.fasta_000003 for processing.
Mon Jan 23 21:37:37 2017 :: Submitted task 6 for annotating ./test_data/test_genome.fasta_000003 with command:  maker -g ./test_data/test_genome.fasta_000003 -base test_genome 
Mon Jan 23 21:37:37 2017 :: Submitting file ./test_data/test_genome.fasta_000011 for processing.
Mon Jan 23 21:37:37 2017 :: Submitted task 7 for annotating ./test_data/test_genome.fasta_000011 with command:  maker -g ./test_data/test_genome.fasta_000011 -base test_genome

Mon Jan 23 21:37:37 2017 :: Submitting file ./test_data/test_genome.fasta_000004 for processing.
Mon Jan 23 21:37:37 2017 :: Submitted task 8 for annotating ./test_data/test_genome.fasta_000004 with command:  maker -g ./test_data/test_genome.fasta_000004 -base test_genome 
Mon Jan 23 21:37:37 2017 :: Submitting file ./test_data/test_genome.fasta_000007 for processing.
Mon Jan 23 21:37:37 2017 :: Submitted task 9 for annotating ./test_data/test_genome.fasta_000007 with command:  maker -g ./test_data/test_genome.fasta_000007 -base test_genome 
Mon Jan 23 21:37:37 2017 :: Submitting file ./test_data/test_genome.fasta_000000 for processing.
Mon Jan 23 21:37:37 2017 :: Submitted task 10 for annotating ./test_data/test_genome.fasta_000000 with command:  maker -g ./test_data/test_genome.fasta_000000-base test_genome 
Mon Jan 23 21:37:37 2017 :: Submitting file ./test_data/test_genome.fasta_000001 for processing.
Mon Jan 23 21:37:37 2017 :: Submitted task 11 for annotating ./test_data/test_genome.fasta_000001 with command:  maker -g ./test_data/test_genome.fasta_000001 -base test_genome 
Mon Jan 23 21:37:37 2017 :: Submitting file ./test_data/test_genome.fasta_000008 for processing.
Mon Jan 23 21:37:37 2017 :: Submitted task 12 for annotating ./test_data/test_genome.fasta_000008 with command:  maker -g ./test_data/test_genome.fasta_000008 -base test_genome 
warning: this work queue master is visible to the public.
warning: you should set a password with the --password option.

Once the maker run is started on the master, and once your WORKERS are in active state

Create a file ~/.ansible.cfg set it to avoid host verification

[defaults]
host_key_checking = False

Create a maker-hosts file with all the ip addresses of the WORKERS in your favorite text editor

[workers]
ip-address-1
ip-address-2
ip-address-3
ip-address-4

Create an Ansible playbook called worker-launch.yml in your favorite text editor

---
- hosts : workers
  tasks :
  - name : Execute the script
    command : work_queue_worker -N maker_run_ud -s /home/upendra/ --debug-rotate-max=0 -d all -o worker.dbg

- hosts is the name of the hosts (workers in this case. It can be anything)
tasks is the task that need to be performed by the Ansible (In this case run work_queue_worker)
name is just name of the task (It can be anything)
-N maker_run_test sets the project name to maker_run_test. This is mandatory if we need to run WQ-MAKER
-s /home/upendra/ Set the location for creating the working directory of the worker
--debug-rotate-max=0 Set the maximum size of the debug log (default 10M, 0 disables)
-d all Sets the debug flag for Work Queue. For all debugging output, try 'all'
-o worker.dbg Sets the debug file for Work Queue

And then run WQ-MAKER on the WORKERS

nohup ansible-playbook -u upendra -i maker-hosts worker-launch.yml &

To check the status of the WQ-MAKER job, run the following.

$ work_queue_status -M maker_run_test
PROJECT            HOST                   PORT WAITING RUNNING COMPLETE WORKERS 
maker_run_test     js-157-131.jetstream-  9155       8       4        0       4

Step 7. Stats output from MASTER instance

The log_file.txt will tell you if the job has been finished or not.

$ cat log_file.txt  
Mon Jan 23 22:00:32 2017 :: Finished WQ task for tiers 3 with result 2.
Mon Jan 23 22:00:32 2017 :: Retrieved 1 so far.
Mon Jan 23 22:00:43 2017 :: Finished WQ task for tiers 1 with result 2.
Mon Jan 23 22:00:43 2017 :: Retrieved 2 so far.
Mon Jan 23 22:00:44 2017 :: Finished WQ task for tiers 2 with result 2.
Mon Jan 23 22:00:44 2017 :: Retrieved 3 so far.
Mon Jan 23 22:00:46 2017 :: Finished WQ task for tiers 4 with result 2.
Mon Jan 23 22:00:46 2017 :: Retrieved 4 so far.
Mon Jan 23 22:06:01 2017 :: Finished WQ task for tiers 7 with result 2.
Mon Jan 23 22:06:01 2017 :: Retrieved 5 so far.
Mon Jan 23 22:09:52 2017 :: Finished WQ task for tiers 5 with result 2.
Mon Jan 23 22:09:52 2017 :: Retrieved 6 so far.
Mon Jan 23 22:10:11 2017 :: Finished WQ task for tiers 8 with result 2.
Mon Jan 23 22:10:11 2017 :: Retrieved 7 so far.
Mon Jan 23 22:10:13 2017 :: Finished WQ task for tiers 6 with result 2.
Mon Jan 23 22:10:13 2017 :: Retrieved 8 so far.
Mon Jan 23 22:16:00 2017 :: Finished WQ task for tiers 9 with result 2.
Mon Jan 23 22:16:00 2017 :: Retrieved 9 so far.
Mon Jan 23 22:17:23 2017 :: Finished WQ task for tiers 12 with result 2.
Mon Jan 23 22:17:23 2017 :: Retrieved 10 so far.
Mon Jan 23 22:19:27 2017 :: Finished WQ task for tiers 10 with result 2.
Mon Jan 23 22:19:27 2017 :: Retrieved 11 so far.
Mon Jan 23 22:19:38 2017 :: Finished WQ task for tiers 11 with result 2.
Mon Jan 23 22:19:38 2017 :: Retrieved 12 so far.
STATUS: Parsing control files...
STATUS: Processing and indexing input FASTA files...
STATUS: Setting up database for any GFF3 input... 
A data structure will be created for you at:
/home/upendra/maker_run/test_genome.maker.output/test_genome_datastore

To access files for individual sequences use the datastore index:
/home/upendra/maker_run/test_genome.maker.output/test_genome_master_datastore_index.log

Mon Jan 23 22:23:25 2017 :: File test_genome annotated :: 12 in total 
-----------------------------------------------------------------
Type    Success Failure Abandon Total
Tasks   12      0       0       12
-----------------------------------------------------------------
Workers:        Joined  Removed Idled-Out       Lost
4               4       0       0               0
-----------------------------------------------------------------
Work Queue Wall Time:   0d 0:45:47.980987
Cumulative Task Wall Time:      0d 2:30:11.787289
Cumulative Task Good Execute Time:      0d 0:00:00.000000
Work Queue Send Time:   0d 0:00:00.852605
Work Queue Receive Time:        0d 0:00:00.274677
-----------------------------------------------------------------
Mon Jan 23 22:23:25 2017 :: MPI used :: Cores 1 :: Memory 0 :: Disk 0 
-----------------------------------------------------------------

The following are the output files from WQ-MAKER

$ ls test_genome.maker.output
maker_bopts.log  maker_exe.log  maker_opts.log  mpi_blastdb  test_genome_datastore  test_genome_master_datastore_index.log

The maker_opts.log, maker_exe.log, and maker_bopts.log files are logs of the control files used for this run of MAKER.
The mpi_blastdb directory contains FASTA indexes and BLAST database files created from the input EST, protein, and repeat databases.
test_genome_master_datastore_index.log contains information on both the run status of individual contigs and information on where individual contig data is stored.
The test_genome_datastore directory contains a set of subfolders, each containing the final MAKER output for individual contigs from the genomic fasta file.

Check the test_genome_master_datastore_index.log and task_outputs.txt to see if there were any failures:

$ cat test_genome_master_datastore_index.log
Chr1    test_genome_datastore/41/30/Chr1/       STARTED
Chr10   test_genome_datastore/7C/72/Chr10/      STARTED
Chr11   test_genome_datastore/1E/AA/Chr11/      STARTED
Chr12   test_genome_datastore/1B/FA/Chr12/      STARTED
Chr2    test_genome_datastore/E9/36/Chr2/       STARTED
Chr3    test_genome_datastore/CC/EF/Chr3/       STARTED
Chr4    test_genome_datastore/A3/11/Chr4/       STARTED
Chr5    test_genome_datastore/8A/9B/Chr5/       STARTED
Chr6    test_genome_datastore/13/44/Chr6/       STARTED
Chr7    test_genome_datastore/91/B7/Chr7/       STARTED
Chr8    test_genome_datastore/9A/9E/Chr8/       STARTED
Chr9    test_genome_datastore/87/90/Chr9/       STARTED
Chr1    test_genome_datastore/41/30/Chr1/       FINISHED
Chr10   test_genome_datastore/7C/72/Chr10/      FINISHED
Chr11   test_genome_datastore/1E/AA/Chr11/      FINISHED
Chr12   test_genome_datastore/1B/FA/Chr12/      FINISHED
Chr2    test_genome_datastore/E9/36/Chr2/       FINISHED
Chr3    test_genome_datastore/CC/EF/Chr3/       FINISHED
Chr4    test_genome_datastore/A3/11/Chr4/       FINISHED
Chr5    test_genome_datastore/8A/9B/Chr5/       FINISHED
Chr6    test_genome_datastore/13/44/Chr6/       FINISHED
Chr7    test_genome_datastore/91/B7/Chr7/       FINISHED
Chr8    test_genome_datastore/9A/9E/Chr8/       FINISHED
Chr9    test_genome_datastore/87/90/Chr9/       FINISHED

All completed. Other possible status entries include:

FAILED - indicates a failed run on this contig, MAKER will retry these
RETRY - indicates that MAKER is retrying a contig that failed
SKIPPED_SMALL - indicates the contig was too short to annotate (minimum contig length is specified in maker_opt.ctl)
DIED_SKIPPED_PERMANENT - indicates a failed contig that MAKER will not attempt to retry (number of times to retry a contig is specified in maker_opt.ctl)

The actual output data is stored in in nested set of directories under* test_genome_datastore* in a nested directory structure.

A typical set of outputs for a contig looks like this:

$ ls test_genome_datastore/*/*/*
test_genome_datastore/13/44/Chr6:
Chr6.gff                                                Chr6.maker.proteins.fasta                 Chr6.maker.transcripts.fasta
Chr6.maker.non_overlapping_ab_initio.proteins.fasta     Chr6.maker.snap_masked.proteins.fasta     run.log
Chr6.maker.non_overlapping_ab_initio.transcripts.fasta  Chr6.maker.snap_masked.transcripts.fasta  theVoid.Chr6

The Chr6.gff file is in GFF3 format and contains the maker gene models and underlying evidence such as repeat regions, alignment data, and ab initio gene predictions, as well as fasta sequence. Having all of these data in one file is important to enable visualization of the called gene models and underlying evidence, especially using tools like Apollo which enable manual editing and curation of gene models.
The fasta files Chr6.maker.proteins.fasta and Chr6.maker.transcripts.fasta contain the protein and transcript sequences for the final MAKER gene calls.
The Chr6.maker.non_overlapping_ab_initio.proteins.fasta and Chr6.maker.non_overlapping_ab_initio.transcripts.fasta files are models that don't overlap MAKER genes that were rejected for lack of support.
The Chr6.maker.snap_masked.proteins.fasta and Chr6.maker.snap_masked.transcript.fasta are the initial SNAP predicted models not further processed by MAKER

The output directory theVoid.Chr1 contains raw output data from all of the pipeline steps. One useful file found here is the repeat-masked version of the contig, query.masked.fasta.

Step 8: Merge the gff files

$ gff3_merge -n -d test_genome_master_datastore_index.log

-d The location of the MAKER datastore index log file.
-n Do not print fasta sequence in footer

By default, the output of the gff3_merge is test_genome.all, but you can have an alternate base name for the output files using "-o" option

The final output from gff3_merge is "test_genome.all.gff"

##gff-version 3
Chr6    maker   gene    43764   46139   .   -   .   ID=maker-Chr6-snap-gene-0.3;Name=maker-Chr6-snap-gene-0.3
Chr6    maker   mRNA    43764   46139   .   -   .   ID=maker-Chr6-snap-gene-0.3-mRNA-1;Parent=maker-Chr6-snap-gene-0.3;Name=maker-Chr6-snap-gene-0.3-mRNA-1;_AED=0.12;_eAED=0.50;_QI=64|0|0|1|0|0.33|3|0|76
Chr6    maker   exon    43764   43846   .   -   .   ID=maker-Chr6-snap-gene-0.3-mRNA-1:exon:2;Parent=maker-Chr6-snap-gene-0.3-mRNA-1
Chr6    maker   exon    44833   44896   .   -   .   ID=maker-Chr6-snap-gene-0.3-mRNA-1:exon:1;Parent=maker-Chr6-snap-gene-0.3-mRNA-1
Chr6    maker   exon    45992   46139   .   -   .   ID=maker-Chr6-snap-gene-0.3-mRNA-1:exon:0;Parent=maker-Chr6-snap-gene-0.3-mRNA-1
Chr6    maker   five_prime_UTR  46076   46139   .   -   .   ID=maker-Chr6-snap-gene-0.3-mRNA-1:five_prime_utr;Parent=maker-Chr6-snap-gene-0.3-mRNA-1
Chr6    maker   CDS 45992   46075   .   -   0   ID=maker-Chr6-snap-gene-0.3-mRNA-1:cds;Parent=maker-Chr6-snap-gene-0.3-mRNA-1
Chr6    maker   CDS 44833   44896   .   -   0   ID=maker-Chr6-snap-gene-0.3-mRNA-1:cds;Parent=maker-Chr6-snap-gene-0.3-mRNA-1
Chr6    maker   CDS 43764   43846   .   -   2   ID=maker-Chr6-snap-gene-0.3-mRNA-1:cds;Parent=maker-Chr6-snap-gene-0.3-mRNA-1

We have noticed that sometimes, some of the scaffolds are missing in the genome file and we are currently investing why. In the meanwhile, you can check to see if there are any missing scaffolds in your current run and if there are any, then this script will extract those missing scaffolds from the genome file and then launches an wq_maker run on the MASTER and once the MASTER is ready, you need to run wq_maker on the WORKERS (same as before).

$ cd test_genome.maker.output
$ iget http://de.cyverse.org/dl/d/5A790A05-1E39-4381-AFB7-30A839181B50/missing_sequences_maker_run.sh
 
$ sh missing_sequences_maker_run.sh -g test_genome.all.gff -r ../test_data/test_genome.fasta -i test_data
No missing scaffolds found. WQ-MAKER has run completely

-g name of the merged gff file generated
-r <path/to/genome fasta file>
-i name of the original input folder (do not specify full path)

Moving data from CyVerse Datastore using iCommands

iCommands is a collection of commands for Linux and Mac OS operating systems that are used in the iRODS system to interact with the CyVerse Data Store. Many commands are very similar to Unix utilities. For example, to list files and directories, in Linux you use ls, but in iCommands you use ils.

While iCommands are great for all transfers and for automating tasks via scripts, they are the best choice for large files (2-100 GB each) and for bulk file transfers (many small files). For a comparison of the different methods of uploading and downloading data items, see Downloading and Uploading Data.

iCommands can be used by CyVerse account users to download files that have been shared by other users and to upload files to the Data Store, as well as add metadata, change permissions, and more. Commonly used iCommands are listed below. Follow the instructions on Setting Up iCommands for how to download and configure iCommands for your operating system.

A CyVerse account is not required to download a public data file via iCommands. To see instructions just for public data download with iCommands, see the iCommands section on Downloading Data Files Without a User Account.

Before you begin, you may want to watch a CyVerse video about iCommands.

For configuring icommands and the different commands that can be used to move the data in and out of datastore, please refer this link

You can use a script to back up your data to the Data Store. Backing up data from an instance to the Data Store. Follow the instruction in here - https://pods.iplantcollaborative.org/wiki/display/atmman/Backing+Up+and+Restoring+Your+Data+to+the+Data+Store

Performance Benchmarking of WQ-MAKER

Using the tutorial data in the image for WQ-MAKER, performance benchmarking was performed with a MASTER and several WORKERS. All instances were medium1 (4CPUs, 8GB memory, 80GB root) and 1 core

Benchmark run	Data used for benchmarking	Number of cores	Number of workers	Time to completion (Mins)
1	12 Chromosomes, 100K bases per chromosome	1	3	43
2	12 Chromosomes, 100K bases per chromosome	1	4	38
3	12 Chromosomes, 200K bases per chromosome	1	5	31
4	12 Chromosomes, 1M bases per chromosome	1	12	24

As you can see as the number of workers increases, the time taken to finish the job decreases. There will not any performance if the there are more than 12 workers used in here, because for the tutorial run, there are only 12 sequences to annoatate. Use the above benchmarking to determine the resources you would need for your project.

Note

Please contact upendra@cyverse.org if you need help determining the optimal amount of resources to annotate your genome.

Requesting Additional Resources on Jetstream

Login to Jetstream and navigate to the Dashboard
Find the "Resources in Use" figure and select the link for "Need more?"
Request the necessary resources to accomplish your
1. Resources that can be requested
  1. Increased maximum
2. Benchmarking can help to determine what you need
Need even more resources? Consider submitting an XSEDE Allocation Proposal

1 Learning Materials

MAKER 2.31.8 with CCTOOLS Jetstream Tutorial

Analytics