Training ab initio Gene Predictors for MAKER genome annotation
Rationale:
If you are involved in a genome project for an emerging model organism, you should already have an EST database, or more likely now mRNA-Seq data, which would have been generated as part of the original sequencing project. A protein database can be collected from closely related organism genome databases or by using the UniProt/SwissProt protein database or the NCBI NR protein database. However, a trained ab initio gene predictor is a much more difficult thing to generate. Gene predictors require existing gene models on which to base prediction parameters. However, with emerging model organisms you are not likely to have any pre-existing gene models. So how then are you supposed to train your gene prediction programs?
Ab initio gene predictors perform much better when they have been trained for a particular genome, and those used by Maker are no exception. However, a reliable training set is not always easily accessible, especially for non-model species. In this situation, the Maker documentation suggests running Maker to generate an initial set of predictions using parameters trained for a related species, then using those predictions as the basis for training and subsequent annotation runs (Maker has an automated process for iterative training and re-annotation). MAKER gives the user the option to produce gene annotations directly from the EST evidence. You can then use these imperfect gene models to train gene predictor program. Once you have re-run WQ-MAKER with the newly trained gene predictor, you can use the second set of gene annotations to train the gene predictors yet again. This boot-strap process allows you to iteratively improve the performance of ab initio gene predictors.
SNAP training:
SNAP, gene prediction uses splicing information from different species to find transcript and coding sequences within a genome assembly.
The following steps assumes that you have run the test data using WQ-MAKER
Make sure you have the final merged file in the working directory - test_genome.all.gff. If not go back to the main WQ-MAKER tutorial
$ cd test_genome.maker.output $ ls maker_bopts.log maker_exe.log maker_opts.log mpi_blastdb test_genome.all.gff test_genome_datastore test_genome_master_datastore_index.log
Step 1: Run SNAP gene prediction (Round 1)
1.1 To train SNAP, we need to convert the GFF3 gene models to ZFF format. To do this we need to collect all GFF3 files into a single directory.
$ mkdir snap1 && cd snap1 $ maker2zff ../test_genome.all.gff
There should now be two new files. The first is the ZFF format file and the second is a FASTA file the coordinates can be referenced against. These will be used to train SNAP.
$ ls genome.ann genome.dna
1.2 The basic steps for training SNAP are first to filter the input gene models, then capture genomic sequence immediately surrounding each model locus, and finally uses those captured segments to produce the HMM. You can explore the internal SNAP documentation for more details if you wish.
$ fathom -categorize 1000 genome.ann genome.dna $ fathom -export 1000 -plus uni.ann uni.dna $ forge export.ann export.dna $ hmm-assembler.pl test_snap1 . > test_snap1.hmm $ cd ../../
1.3 The final training parameters file is test_snap1.hmm. We do not expect SNAP to perform that well with this training file because it is based on incomplete gene models; however, this file is a good starting point for further training. We need to run MAKER again with the new HMM file we just built for SNAP. Edit the maker_opt.ctl file to include the snap1 hmm file
$ nano maker_opts.ctl #-----Gene Prediction snaphmm=./test_genome.maker.output/snap1/test_snap1.hmm #SNAP HMM file <= Insert the name of the hmm file here gmhmm= #GeneMark HMM file augustus_species= #Augustus gene prediction species model fgenesh_par_file= #FGENESH parameter file pred_gff= #ab-initio predictions from an external GFF3 file model_gff= #annotated gene models from an external GFF3 file (annotation pass-through) est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no <= Change this to 0 protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no <= Change this to 0 unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
1.4 Run WQ-MAKER
Before running MAKER, check to make sure all worker instances have become active.
On the MASTER instance, make sure you are in the "maker_run" directory and all of your files are in place and then run:
$ nohup wq_maker -contigs-per-split 1 -cores 1 -memory 2048 -disk 4096 -N wq_test_${USER} -d all -o master.dbg -debug_size_limit=0 -stats test_out_stats.txt -base test_genome_snap1 > log_file.txt 2>&1 &
Wait for the MASTER to advertise master status to the catalog server before your run WQ-MAKER on the WORKERS (see below).
Once the log_file show the above output and once your WORKERS are in active state, then either ssh into or use webshell into each of the WORKERS and then run
$ nohup work_queue_worker -N wq_test_${USER} --cores all --debug-rotate-max=0 -d all -o worker.dbg &
Advanced users
$ nohup ansible-playbook -u ${USER} -i maker-hosts worker-launch.yml > log_file_2.txt 2>&1 &
To check the status of the WQ-MAKER job, run the following.
$ work_queue_status -M wq_test_${USER} PROJECT HOST PORT WAITING RUNNING COMPLETE WORKERS wq_test_ud js-157-131.jetstream- 9155 0 12 0 2
The log_file.txt will tell you if the job has been finished or not. The following are the output files from WQ-MAKER
$ ls test_genome.maker.output maker_bopts.log maker_opts.log seen.dbm test_genome_snap1.all.gff test_genome_snap1_master_datastore_index.log maker_exe.log mpi_blastdb snap2 test_genome_snap1_datastore
Step 2: Run SNAP gene prediction (Round 2)
Let's retrain SNAP, and run MAKER again.
$ cd test_genome_snap1.maker.output $ ls maker_exe.ctl maker_opts.ctl task_outputs.txt test_genome.maker.output test_out_stats.txt maker_bopts.ctl master.dbg test_data test_genome_snap1.maker.output
Merging the gff files.
$ gff3_merge -d test_genome_snap1_master_datastore_index.log
Except to see the final merged file in the working directory - test_genome_snap1.all.gff
2.1 Run SNAP gene prediction
To train SNAP, we need to convert the GFF3 gene models to ZFF format. To do this we need to collect all GFF3 files into a single directory.
$ mkdir snap2 && cd snap2 $ maker2zff ../test_genome_snap1.all.gff
There should now be two new files (genome.ann and genome.dna). The first is the ZFF format file and the second is a FASTA file the coordinates can be referenced against. These will be used to train SNAP.
2.2 The basic steps for training SNAP are the same as before
$ fathom -categorize 1000 genome.ann genome.dna $ fathom -export 1000 -plus uni.ann uni.dna $ forge export.ann export.dna $ hmm-assembler.pl test_snap2 . > test_snap2.hmm $ cd ../../
The final training parameters file is test_snap2.hmm. SNAP is now trained and now we will train the Augustus using BUSCO (see below) and then run final WQ-MAKER. If you don't want to run Augustus with SNAP, then you can run WQ-MAKER using test_snap2.hmm file (follow steps 2.3 and 2.4 and replace the test_snap1.hmm with test_snap2.hmm and give a different name to the output directory (-base test_genome_snap2)). Finally, merge to generate final gff file (Add the option `-n` this time).
Augustus training using BUSCO (Optional but recommended):
BUSCO (Benchmarking UniversalSingle-Copy Orthologs) is a tool that provides measures for quantitative assessment of genome assembly, gene set, and transcriptome completeness based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs selected from OrthoDB. BUSCO assessments are implemented in open-source software, with comprehensive lineage-specific sets of Benchmarking Universal Single-Copy Orthologs for arthropods, vertebrates, metazoans, fungi, eukaryotes, and bacteria. These conserved orthologs are ideal candidates for large-scale phylogenomics studies, and the annotated BUSCO gene models built during genome assessments provide a comprehensive gene predictor training set for use as part of genome annotation pipelines. BUSCO assessments offer intuitive metrics, based on evolutionarily informed expectations of gene content from hundreds of species, to gauge completeness of rapidly accumulating genomic data and satisfy an Iberian's quest for quality - "Busco calidad/qualidade". The software is freely available to download at (http://busco.ezlab.org/).
Step 1: Run BUSCO Using Docker image (Takes a really long time)
- We have already installed the latest version of BUSCO Docker image in the Jetstream instance. The version that is installed is v3.0 which is the latest version as of 20th Sept 2017.
Create an Augustus directory in the main project folder
$ cd /vol_b/wq_maker_run $ mkdir augustus && cd augustus
Download the lineage file specific for your species of interest. All the lineage data can be found and downloaded from here. Since the test data is from Rice, here we will be downloading the embrophyta dataset
$ wget http://busco.ezlab.org/v2/datasets/embryophyta_odb9.tar.gz $ tar xvf embryophyta_odb9.tar.gz
Copy the test data to this folder and run BUSCO to generate Augustus gene models
$ cp ../test_data/test_genome.fasta .
$ ezd # to update the Docker on your instance $ run_BUSCO -h usage: python BUSCO.py -i [SEQUENCE_FILE] -l [LINEAGE] -o [OUTPUT_NAME] -m [MODE] [OTHER OPTIONS] Welcome to BUSCO 3.0.2: the Benchmarking Universal Single-Copy Ortholog assessment tool. For more detailed usage information, please review the README file provided with this distribution and the BUSCO user guide. optional arguments: -i FASTA FILE, --in FASTA FILE Input sequence file in FASTA format. Can be an assembled genome or transcriptome (DNA), or protein sequences from an annotated gene set. -c N, --cpu N Specify the number (N=integer) of threads/cores to use. -o OUTPUT, --out OUTPUT Give your analysis run a recognisable short name. Output folders and files will be labelled with this name. WARNING: do not provide a path -e N, --evalue N E-value cutoff for BLAST searches. Allowed formats, 0.001 or 1e-03 (Default: 1e-03) -m MODE, --mode MODE Specify which BUSCO analysis mode to run. There are three valid modes: - geno or genome, for genome assemblies (DNA) - tran or transcriptome, for transcriptome assemblies (DNA) - prot or proteins, for annotated gene sets (protein) -l LINEAGE, --lineage_path LINEAGE Specify location of the BUSCO lineage data to be used. Visit http://busco.ezlab.org for available lineages. -f, --force Force rewriting of existing files. Must be used when output files with the provided name already exist. -r, --restart Restart an uncompleted run. Not available for the protein mode -sp SPECIES, --species SPECIES Name of existing Augustus species gene finding parameters. See Augustus documentation for available options. --augustus_parameters AUGUSTUS_PARAMETERS Additional parameters for the fine-tuning of Augustus run. For the species, do not use this option. Use single quotes as follow: '--param1=1 --param2=2', see Augustus documentation for available options. -t PATH, --tmp_path PATH Where to store temporary files (Default: ./tmp/) --limit REGION_LIMIT How many candidate regions (contig or transcript) to consider per BUSCO (default: 3) --long Optimization mode Augustus self-training (Default: Off) adds considerably to the run time, but can improve results for some non-model organisms -q, --quiet Disable the info logs, displays only errors -z, --tarzip Tarzip the output folders likely to contain thousands of files --blast_single_core Force tblastn to run on a single core and ignore the --cpu argument for this step only. Useful if inconsistencies when using multiple threads are noticed -v, --version Show this version and exit -h, --help Show this help message and exit
run_BUSCO -i test_genome.fasta -l embryophyta_odb9 -o run_example -m genome -sp rice -f > log_file.txt 2>&1 &
Once the BUSCO run is finished, rename the retraining_parameters folder which is inside the output folder based on the name of the files inside and change the permissions of the folder
$ cd run_example/augustus_output $ ls retraining_parameters # The ID - 3891028551 change from run to run. So check your run's id BUSCO_augustus_out_3891028551_exon_probs.pbl BUSCO_augustus_out_3891028551_metapars.cfg BUSCO_augustus_out_3891028551_parameters.cfg BUSCO_augustus_out_3891028551_igenic_probs.pbl BUSCO_augustus_out_3891028551_metapars.cgp.cfg BUSCO_augustus_out_3891028551_weightmatrix.txt BUSCO_augustus_out_3891028551_intron_probs.pbl BUSCO_augustus_out_3891028551_metapars.utr.cfg # Rename the directory based on the prefix. For example $ sudo mv retraining_parameters BUSCO_augustus_out_3891028551 # Change the permissions on the renamed folder $ sudo chown -hR ${USER} BUSCO_augustus_out_3891028551 $ sudo chgrp -hR ${USER} BUSCO_augustus_out_3891028551
Copy the augustus config file /opt folder to your current directory and the renamed augustus folder into the species folder. Then continue to Step 2.
$ cd /vol_b/wq_maker_run $ cp -r /opt/augustus/config . $ sudo mv augustus/run_example/augustus_output/BUSCO_augustus_out_3891028551 config/species
Step 2: Adjust the parameters in the maker_opts.ctl file
$ nano maker_opts.ctl #-----Gene Prediction snaphmm=test_genome_snap1.maker.output/snap2/test_snap2.hmm #SNAP HMM file <= Insert the name of the hmm file here gmhmm= #GeneMark HMM file augustus_species=BUSCO_augustus_out_4208967406 #Augustus gene prediction species model fgenesh_par_file= #FGENESH parameter file pred_gff= #ab-initio predictions from an external GFF3 file model_gff= #annotated gene models from an external GFF3 file (annotation pass-through) est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no <= Change this to 0 protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no <= Change this to 0 unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
Step 3: Run WQ-MAKER
Before running MAKER, check to make sure all worker instances have become active.
On the MASTER instance, make sure you are in the "maker_run" directory and all of your files are in place and then run:
$ nohup wq_maker -contigs-per-split 1 -cores 1 -memory 2048 -disk 4096 -N wq_test_${USER} -d all -o master.dbg -debug_size_limit=0 -stats test_out_stats.txt -base test_genome_snap2_augustus > log_file.txt 2>&1 &
Wait for the MASTER to advertise master status to the catalog server before your run WQ-MAKER on the WORKERS (see below).
Once the log_file show the above output and once your WORKERS are in active state, then either ssh into or use webshell into each of the WORKERS and then run
$ nohup work_queue_worker -N wq_test_${USER} --cores all --debug-rotate-max=0 -d all -o worker.dbg &
Advanced users
$ nohup ansible-playbook -u ${USER} -i maker-hosts worker-launch.yml > log_file_2.txt 2>&1 &
Once the job is finished merge gff file
$ cd test_genome_snap2_augustus.maker.output $ gff3_merge -n -d test_genome_snap2_augustus_master_datastore_index.log # All gene models $ gff3_merge -g -n -d test_genome_snap2_augustus_master_datastore_index.log -o test_genome_snap2_augustus.maker.gff # Maker gene models only
This marks the completion of gene annoation. When you are done, look at one of the larger contigs in a viewer like apollo and compare the raw augustus calls, raw snap calls, and the evidence aware augustus and snap calls produced by maker. If SNAP and augustus are properly trained then they will produce similar calls, and they will also be similar to the evidence aware calls from MAKER (this convergence is the result of the training). If one predictor seems to produce calls that are still very divergent, then just drop that predictor from the analysis. A bad predictor will make all results worse