Copy of Training ab initio Gene Predictors for MAKER-P genome annotation

Rationale:

If you are involved in a genome project for an emerging model organism, you should already have an EST database, or more likely now mRANA-Seq data, which would have been generated as part of the original sequencing project. A protein database can be collected from closely related organism genome databases or by using the UniProt/SwissProt protein database or the NCBI NR protein database. However a trained ab initio gene predictor is a much more difficult thing to generate. Gene predictors require existing gene models on which to base prediction parameters. However, with emerging model organisms you are not likely to have any pre-existing gene models. So how then are you supposed to train your gene prediction programs?

Ab initio gene predictors perform much better when they have been trained for a particular genome, and those used by Maker are no exception. However, a reliable training set is not always easily accessible, especially for non-model species. In this situation the Maker documentation suggests running Maker to generate an initial set of predictions using parameters trained for a related species, then using those predictions as the basis for training and subsequent annotation runs (Maker has an automated process for iterative training and re-annotation). MAKER gives the user the option to produce gene annotations directly from the EST evidence. You can then use these imperfect gene models to train gene predictor program. Once you have re-run MAKER with the newly trained gene predictor, you can use the second set of gene annotations to train the gene predictors yet again. This boot-strap process allows you to iteratively improve the performance of ab initio gene predictors.

SNAP training:

SNAP, gene prediction uses splicing information from different species to find transcript and coding sequences within a genome assembly.

The following steps assumes that you have run the test data using Copy of MAKER-P_2.31.9 Atmosphere Tutorial

Make sure you have the final merged file in the working directory - test_genome.all.gff. If not go back to the main Copy of MAKER-P_2.31.9 Atmosphere Tutorial tutorial

$ cd test_genome.maker.output
$ ls
maker_bopts.log  maker_exe.log  maker_opts.log  mpi_blastdb  test_genome.all.gff  test_genome_datastore  test_genome_master_datastore_index.log

Step 1: Run SNAP gene prediction (Round 1)

1.1 To train SNAP, we need to convert the GFF3 gene models to ZFF format. To do this we need to collect all GFF3 files into a single directory.

$ mkdir snap1 && cd snap1
$ maker2zff ../test_genome.all.gff

There should now be two new files. The first is the ZFF format file and the second is a FASTA file the coordinates can be referenced against. These will be used to train SNAP.

$ ls
genome.ann genome.dna

1.2 The basic steps for training SNAP are first to filter the input gene models, then capture genomic sequence immediately surrounding each model locus, and finally uses those captured segments to produce the HMM. You can explore the internal SNAP documentation for more details if you wish.

$ sh /opt/misc_scripts/fathom1.sh
$ cd ../../

1.3 The final training parameters file is test_snap1.hmm. We do not expect SNAP to perform that well with this training file because it is based on incomplete gene models; however, this file is a good starting point for further training. We need to run MAKER again with the new HMM file we just built for SNAP. Edit the maker_opt.ctl file to include the snap1 hmm file

$ nano maker_opts.ctl
 
#-----Gene Prediction
snaphmm=./test_genome.maker.output/snap1/test_snap1.hmm #SNAP HMM file <= Insert the name of the hmm file here
gmhmm= #GeneMark HMM file
augustus_species= #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no <= Change this to 0
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no <= Change this to 0
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no

1.4 Run MAKER-P

$ nohup mpiexec -n 4 maker -base test_genome_snap1 &

The log_file.txt will tell you if the job has been finished or not. The following are the output files from WQ-MAKER

$ ls test_genome_snap1.maker.output/
maker_bopts.log  maker_exe.log  maker_opts.log  mpi_blastdb  seen.dbm  test_genome_snap1_datastore  test_genome_snap1.db  test_genome_snap1_master_datastore_index.log

cat test_genome_snap1.maker.output/test_genome_snap1_master_datastore_index.log 
Chr1	test_genome_snap1_datastore/41/30/Chr1/	STARTED
Chr10	test_genome_snap1_datastore/7C/72/Chr10/	STARTED
Chr11	test_genome_snap1_datastore/1E/AA/Chr11/	STARTED
Chr10	test_genome_snap1_datastore/7C/72/Chr10/	FINISHED
Chr12	test_genome_snap1_datastore/1B/FA/Chr12/	STARTED
Chr11	test_genome_snap1_datastore/1E/AA/Chr11/	FINISHED
Chr2	test_genome_snap1_datastore/E9/36/Chr2/	STARTED
Chr1	test_genome_snap1_datastore/41/30/Chr1/	FINISHED
Chr3	test_genome_snap1_datastore/CC/EF/Chr3/	STARTED
Chr12	test_genome_snap1_datastore/1B/FA/Chr12/	FINISHED
Chr4	test_genome_snap1_datastore/A3/11/Chr4/	STARTED
Chr3	test_genome_snap1_datastore/CC/EF/Chr3/	FINISHED
Chr5	test_genome_snap1_datastore/8A/9B/Chr5/	STARTED
Chr2	test_genome_snap1_datastore/E9/36/Chr2/	FINISHED
Chr6	test_genome_snap1_datastore/13/44/Chr6/	STARTED
Chr6	test_genome_snap1_datastore/13/44/Chr6/	FINISHED
Chr7	test_genome_snap1_datastore/91/B7/Chr7/	STARTED
Chr4	test_genome_snap1_datastore/A3/11/Chr4/	FINISHED
Chr8	test_genome_snap1_datastore/9A/9E/Chr8/	STARTED
Chr5	test_genome_snap1_datastore/8A/9B/Chr5/	FINISHED
Chr9	test_genome_snap1_datastore/87/90/Chr9/	STARTED
Chr8	test_genome_snap1_datastore/9A/9E/Chr8/	FINISHED
Chr9	test_genome_snap1_datastore/87/90/Chr9/	FINISHED
Chr7	test_genome_snap1_datastore/91/B7/Chr7/	FINISHED

Step 2: Run SNAP gene prediction (Round 2)

Let's retrain SNAP, and run MAKER again.

$ cd test_genome_snap1.maker.output
$ ls     
maker_bopts.log  maker_exe.log  maker_opts.log  mpi_blastdb  seen.dbm  test_genome_snap1_datastore  test_genome_snap1.db  test_genome_snap1_master_datastore_index.log

Merging the gff files.

$ gff3_merge -d test_genome_snap1_master_datastore_index.log

Except to see the final merged file in the working directory - test_genome_snap1.all.gff

2.1 Run SNAP gene prediction (Round 2)

To train SNAP, we need to convert the GFF3 gene models to ZFF format. To do this we need to collect all GFF3 files into a single directory.

$ mkdir snap2 && cd snap2
$ maker2zff ../test_genome_snap1.all.gff

There should now be two new files (genome.ann and genome.dna). The first is the ZFF format file and the second is a FASTA file the coordinates can be referenced against. These will be used to train SNAP.

2.2 The basic steps for training SNAP are the same as before

$ sh /opt/misc_scripts/fathom2.sh
$ cd ../../

The final training parameters file is test_snap2.hmm. SNAP is now trained and now we will train the Augustus using BUSCO (see below) and then run final MAKER-P. If you don't want to run Augustus with SNAP, then you can run WQ-MAKER using test_snap2.hmm file (follow steps 2.3 and 2.4 and replace the test_snap1.hmm with test_snap2.hmm and give a different name to the output directory (-base test_genome_snap2)). Finally merge to generate final gff file (Add the option `-n` this time).

Augustus training using BUSCO (Optional but recommended):

BUSCO (Benchmarking UniversalSingle-Copy Orthologs) is a tool that provides measures for quantitative assessment of genome assembly, gene set, and transcriptome completeness based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs selected from OrthoDB. BUSCO assessments are implemented in open-source software, with comprehensive lineage-specific sets of Benchmarking Universal Single-Copy Orthologs for arthropods, vertebrates, metazoans, fungi, eukaryotes, and bacteria. These conserved orthologs are ideal candidates for large-scale phylogenomics studies, and the annotated BUSCO gene models built during genome assessments provide a comprehensive gene predictor training set for use as part of genome annotation pipelines. BUSCO assessments offer intuitive metrics, based on evolutionarily informed expectations of gene content from hundreds of species, to gauge completeness of rapidly accumulating genomic data and satisfy an Iberian's quest for quality - "Busco calidad/qualidade". The software is freely available to download at (http://busco.ezlab.org/).

Step 1: Run BUSCO Using Docker image (Takes a really long time and so make sure you run it using screen or tmux)

We have already installed the latest version of BUSCO Docker image in the Jetstream instance. The version that is installed is v3.0 which is the latest version as of 20th Sept 2017.

1.1 Create a augustus directory in the main project folder

$ cd ../
$ mkdir augustus && cd augustus

1.2 Download the lineage file specific for your species of interest. All the lineage data can be found and downloaded from here. Since the test data is from Rice, here we will be downloading the embrophyta dataset

$ wget http://busco.ezlab.org/v2/datasets/embryophyta_odb9.tar.gz
$ tar xvf embryophyta_odb9.tar.gz
$ rm embryophyta_odb9.tar.gz

1.3 Copy the test data to this folder and run BUSCO to generate Augustus gene models.

$ cp ../../test_data/test_genome.fasta .

1.4 Update the Docker on the instance

$ ezd
* Updating ez docker and installing docker (this may take a few minutes, coffee break!)
Cloning into '/opt/cyverse-ez-docker'...
remote: Counting objects: 38, done.
remote: Compressing objects: 100% (7/7), done.
remote: Total 38 (delta 1), reused 0 (delta 0), pack-reused 31
Unpacking objects: 100% (38/38), done.
Checking connectivity... done.
* docker was updated successfully

1.5 Run BUSCO now

$ run_BUSCO -h
usage: python BUSCO.py -i [SEQUENCE_FILE] -l [LINEAGE] -o [OUTPUT_NAME] -m [MODE] [OTHER OPTIONS]
Welcome to BUSCO 3.0.2: the Benchmarking Universal Single-Copy Ortholog assessment tool.
For more detailed usage information, please review the README file provided with this distribution and the BUSCO user guide.
optional arguments:
-i FASTA FILE, --in FASTA FILE
Input sequence file in FASTA format. Can be an assembled genome or transcriptome (DNA), or protein sequences from an annotated gene set.
-c N, --cpu N Specify the number (N=integer) of threads/cores to use.
-o OUTPUT, --out OUTPUT
Give your analysis run a recognisable short name. Output folders and files will be labelled with this name. WARNING: do not provide a path
-e N, --evalue N E-value cutoff for BLAST searches. Allowed formats, 0.001 or 1e-03 (Default: 1e-03)
-m MODE, --mode MODE Specify which BUSCO analysis mode to run.
There are three valid modes:
- geno or genome, for genome assemblies (DNA)
- tran or transcriptome, for transcriptome assemblies (DNA)
- prot or proteins, for annotated gene sets (protein)
-l LINEAGE, --lineage_path LINEAGE
Specify location of the BUSCO lineage data to be used.
Visit http://busco.ezlab.org for available lineages.
-f, --force Force rewriting of existing files. Must be used when output files with the provided name already exist.
-r, --restart Restart an uncompleted run. Not available for the protein mode
-sp SPECIES, --species SPECIES
Name of existing Augustus species gene finding parameters. See Augustus documentation for available options.
--augustus_parameters AUGUSTUS_PARAMETERS
Additional parameters for the fine-tuning of Augustus run. For the species, do not use this option.
Use single quotes as follow: '--param1=1 --param2=2', see Augustus documentation for available options.
-t PATH, --tmp_path PATH
Where to store temporary files (Default: ./tmp/)
--limit REGION_LIMIT How many candidate regions (contig or transcript) to consider per BUSCO (default: 3)
--long Optimization mode Augustus self-training (Default: Off) adds considerably to the run time, but can improve results for some non-model organisms
-q, --quiet Disable the info logs, displays only errors
-z, --tarzip Tarzip the output folders likely to contain thousands of files
--blast_single_core Force tblastn to run on a single core and ignore the --cpu argument for this step only. Useful if inconsistencies when using multiple threads are noticed
-v, --version Show this version and exit
-h, --help Show this help message and exit

Run Busco with genome fasta file as input and embryophyta as lineage. The test run should take ~15 min on a 4 CPU instance

$ run_BUSCO -i test_genome.fasta -l embryophyta_odb9 -o run_example -m genome -sp rice -f &

1.6 Once the BUSCO run is finished, rename the retraining_parameters folder which is inside the output folder based on the name of the files inside and change the permissions of the folder

$ cd run_run_example/augustus_output

# Extract the prefix of the BUSCO run
$ name=$(ls retraining_parameters | head -n 1 | sed 's/_exon_probs.pbl//')

# Rename the directory based on the prefix. 
$ sudo mv retraining_parameters $name
 
# Change the permissions on the renamed folder
$ sudo chown -hR ${USER} BUSCO_run_example_*
$ sudo chgrp -hR iplant-everyone BUSCO_run_example_*

1.7 Copy the augustus config file /opt folder to your current directory and the renamed augustus folder into the species folder. Then continue to Step 2.

$ sudo mv BUSCO_run_example_* /opt/augustus-3.2.2/config/species
$ cd ../../../../

Step 2: Run MAKER-P

2.1 Adjust the parameters in the maker_opts.ctl file

$ nano maker_opts.ctl
 
#-----Gene Prediction
snaphmm=test_genome_snap1.maker.output/snap2/test_snap2.hmm #SNAP HMM file <= Insert the name of the hmm file here
gmhmm= #GeneMark HMM file
augustus_species=BUSCO_run_example_4173945430 #Augustus gene prediction species model. Type "echo $name" and copy the output to here
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no <= Change this to 0
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no <= Change this to 0
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no

2.2 Run MAKER-P now

$ nohup mpiexec -n 4 maker -base test_genome_snap2_augustus &

Once the job is finished merge and rename final gff file

$ cd test_genome_snap2_augustus.maker.output
$ gff3_merge -d test_genome_snap2_augustus_master_datastore_index.log
$ fasta_merge -d test_genome_snap2_augustus_master_datastore_index.log
 
$ maker_map_ids --prefix test_ --justify 8 test_genome_snap2_augustus.all.gff > test.genome.all.id.map
$ map_gff_ids test.genome.all.id.map test_genome_snap2_augustus.all.gff
$ map_fasta_ids test.genome.all.id.map test_genome_snap2_augustus.all.maker.transcripts.fasta
$ map_fasta_ids test.genome.all.id.map test_genome_snap2_augustus.all.maker.proteins.fasta

This marks the completion of gene annoation. When you are done, look at one of the larger contigs in a viewer like apollo and compare the raw augustus calls, raw snap calls, and the evidence aware augustus and snap calls produced by maker. If SNAP and augustus are properly trained then they will produce similar calls, and they will also be similar to the evidence aware calls from MAKER (this convergence is the result of the training). If one predictor seems to produce calls that are still very divergent, then just drop that predictor from the analysis. A bad predictor will make all results worse