Rationale:
If you are involved in a genome project for an emerging model organism, you should already have an EST database, or more likely now mRANA-Seq data, which would have been generated as part of the original sequencing project. A protein database can be collected from closely related organism genome databases or by using the UniProt/SwissProt protein database or the NCBI NR protein database. However a trained ab initio gene predictor is a much more difficult thing to generate. Gene predictors require existing gene models on which to base prediction parameters. However, with emerging model organisms you are not likely to have any pre-existing gene models. So how then are you supposed to train your gene prediction programs?
Ab initio gene predictors perform much better when they have been trained for a particular genome, and those used by Maker are no exception. However, a reliable training set is not always easily accessible, especially for non-model species. In this situation the Maker documentation suggests running Maker to generate an initial set of predictions using parameters trained for a related species, then using those predictions as the basis for training and subsequent annotation runs (Maker has an automated process for iterative training and re-annotation). MAKER gives the user the option to produce gene annotations directly from the EST evidence. You can then use these imperfect gene models to train gene predictor program. Once you have re-run WQ-MAKER with the newly trained gene predictor, you can use the second set of gene annotations to train the gene predictors yet again. This boot-strap process allows you to iteratively improve the performance of ab initio gene predictors.
SNAP training
...
:
SNAP, gene prediction uses splicing information from different species to find transcript and coding sequences within a genome assembly.
...
Step 2: Run SNAP gene prediction (Round 2)
Let's retrain SNAP, and run MAKER again.
Code Block |
---|
$ cd test_genome_snap1.maker.output $ ls maker_bopts.log maker_exe.log maker_opts.log mpi_blastdb seen.dbm test_genome_snap1_datastore test_genome_snap1.db test_genome_snap1_master_datastore_index.log |
Merging the gff files.
Code Block |
---|
$ gff3_merge -d test_genome_snap1_master_datastore_index.log |
Except to see the final merged file in the working directory - test_genome_snap1.all.gff
2.1 Run SNAP gene prediction (Round 2)
To train SNAP, we need to convert the GFF3 gene models to ZFF format. To do this we need to collect all GFF3 files into a single directory.
Code Block |
---|
$ mkdir snap2 && cd snap2 $ maker2zff ../test_genome_snap1.all.gff |
There should now be two new files (genome.ann and genome.dna). The first is the ZFF format file and the second is a FASTA file the coordinates can be referenced against. These will be used to train SNAP.
2.2 The basic steps for training SNAP are the same as before
Code Block |
---|
$ sh /opt/misc_scripts/fathom2.sh $ cd ../../ |
The final training parameters file is test_snap2.hmm. SNAP is now trained and now we will train the Augustus using BUSCO (see below) and then run final MAKER-P. If you don't want to run Augustus with SNAP, then you can run WQ-MAKER using test_snap2.hmm file (follow steps 2.3 and 2.4 and replace the test_snap1.hmm with test_snap2.hmm and give a different name to the output directory (-base test_genome_snap2)). Finally merge to generate final gff file (Add the option `-n` this time).
Augustus training using BUSCO (Optional but recommended):
BUSCO (Benchmarking UniversalSingle-Copy Orthologs) is a tool that provides measures for quantitative assessment of genome assembly, gene set, and transcriptome completeness based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs selected from OrthoDB. BUSCO assessments are implemented in open-source software, with comprehensive lineage-specific sets of Benchmarking Universal Single-Copy Orthologs for arthropods, vertebrates, metazoans, fungi, eukaryotes, and bacteria. These conserved orthologs are ideal candidates for large-scale phylogenomics studies, and the annotated BUSCO gene models built during genome assessments provide a comprehensive gene predictor training set for use as part of genome annotation pipelines. BUSCO assessments offer intuitive metrics, based on evolutionarily informed expectations of gene content from hundreds of species, to gauge completeness of rapidly accumulating genomic data and satisfy an Iberian's quest for quality - "Busco calidad/qualidade". The software is freely available to download at (http://busco.ezlab.org/).
Step 1: Run BUSCO Using Docker image (Takes a really long time and so make sure you run it using screen or tmux)
We have already installed the latest version of BUSCO Docker image in the Jetstream instance. The version that is installed is v3.0 which is the latest version as of 20th Sept 2017.
Creat1.1 Create a augustus directory in the main project folder
Code Block |
---|
$ cd ../
$ mkdir augustus && cd augustus |
1.2 Download the lineage file specific for your species of interest. All the lineage data can be found and downloaded from here. Since the test data is from Rice, here we will be downloading the embrophyta dataset
Code Block |
---|
$ wget http://busco.ezlab.org/v2/datasets/embryophyta_odb9.tar.gz
$ tar xvf embryophyta_odb9.tar.gz
$ rm embryophyta_odb9.tar.gz |
1.3 Copy the test data to this folder and run BUSCO to generate Augustus gene models.
Code Block |
---|
$ cp ../../test_data/test_genome.fasta . |
1.4 Update the Docker on the instance
Code Block |
---|
$ ezd
* Updating ez docker and installing docker (this may take a few minutes, coffee break!)
Cloning into '/opt/cyverse-ez-docker'...
remote: Counting objects: 38, done.
remote: Compressing objects: 100% (7/7), done.
remote: Total 38 (delta 1), reused 0 (delta 0), pack-reused 31
Unpacking objects: 100% (38/38), done.
Checking connectivity... done.
* docker was updated successfully |
1.5 Run BUSCO now
Code Block |
---|
$ run_BUSCO -h
usage: python BUSCO.py -i [SEQUENCE_FILE] -l [LINEAGE] -o [OUTPUT_NAME] -m [MODE] [OTHER OPTIONS]
Welcome to BUSCO 3.0.2: the Benchmarking Universal Single-Copy Ortholog assessment tool.
For more detailed usage information, please review the README file provided with this distribution and the BUSCO user guide.
optional arguments:
-i FASTA FILE, --in FASTA FILE
Input sequence file in FASTA format. Can be an assembled genome or transcriptome (DNA), or protein sequences from an annotated gene set.
-c N, --cpu N Specify the number (N=integer) of threads/cores to use.
-o OUTPUT, --out OUTPUT
Give your analysis run a recognisable short name. Output folders and files will be labelled with this name. WARNING: do not provide a path
-e N, --evalue N E-value cutoff for BLAST searches. Allowed formats, 0.001 or 1e-03 (Default: 1e-03)
-m MODE, --mode MODE Specify which BUSCO analysis mode to run.
There are three valid modes:
- geno or genome, for genome assemblies (DNA)
- tran or transcriptome, for transcriptome assemblies (DNA)
- prot or proteins, for annotated gene sets (protein)
-l LINEAGE, --lineage_path LINEAGE
Specify location of the BUSCO lineage data to be used.
Visit http://busco.ezlab.org for available lineages.
-f, --force Force rewriting of existing files. Must be used when output files with the provided name already exist.
-r, --restart Restart an uncompleted run. Not available for the protein mode
-sp SPECIES, --species SPECIES
Name of existing Augustus species gene finding parameters. See Augustus documentation for available options.
--augustus_parameters AUGUSTUS_PARAMETERS
Additional parameters for the fine-tuning of Augustus run. For the species, do not use this option.
Use single quotes as follow: '--param1=1 --param2=2', see Augustus documentation for available options.
-t PATH, --tmp_path PATH
Where to store temporary files (Default: ./tmp/)
--limit REGION_LIMIT How many candidate regions (contig or transcript) to consider per BUSCO (default: 3)
--long Optimization mode Augustus self-training (Default: Off) adds considerably to the run time, but can improve results for some non-model organisms
-q, --quiet Disable the info logs, displays only errors
-z, --tarzip Tarzip the output folders likely to contain thousands of files
--blast_single_core Force tblastn to run on a single core and ignore the --cpu argument for this step only. Useful if inconsistencies when using multiple threads are noticed
-v, --version Show this version and exit
-h, --help Show this help message and exit
|
Run Busco with genome fasta file as input and embryophyta as lineage. The test run should take ~15 min on a 4 CPU instance
Code Block |
---|
$ run_BUSCO -i test_genome.fasta -l embryophyta_odb9 -o run_example -m genome -sp rice -f |
|
& |
1.6 Once the BUSCO run is finished, rename the retraining_parameters folder which is inside the output folder based on the name of the files inside and change the permissions of the folder
Code Block |
---|
$ cd run_run_example/augustus_output # Extract the prefix of the BUSCO run $ name=$(ls retraining_parameters |
| |
head -n |
1 |
| |
sed 's/_exon_probs.pbl//') # Rename the directory based on |
the prefix. $ sudo mv |
Copy the augustus config file /opt folder to your current directory and the renamed augustus folder into the species folder. Then continue to Step 2.
Code Block |
---|
$ cd /vol_b/wq_maker_run
$ cp -r /opt/augustus/config .
$ sudo mv augustus/run_example/augustus_output/BUSCO_augustus_out_3891028551 config/species |
Step 2: Adjust the parameters in the maker_opts.ctl file
retraining_parameters $name
# Change the permissions on the renamed folder
$ sudo chown -hR ${USER} BUSCO_run_example_*
$ sudo chgrp -hR iplant-everyone BUSCO_run_example_* |
1.7 Copy the augustus config file /opt folder to your current directory and the renamed augustus folder into the species folder. Then continue to Step 2.
Code Block |
---|
$ sudo mv BUSCO_run_example_* /opt/augustus-3.2.2/config/species
$ cd ../../../../
|
Step 2: Run MAKER-P
2.1 Adjust the parameters in the maker_opts.ctl file
Code Block |
---|
$ nano maker_opts.ctl #-----Gene Prediction snaphmm=test_genome_snap1.maker.output/snap2/test_snap2.hmm #SNAP HMM file <= Insert the name of the hmm file here gmhmm= #GeneMark HMM file augustus_species=BUSCO_run_example_4173945430 #Augustus gene prediction species model. Type "echo $name" and copy the output to here fgenesh_par_file= #FGENESH parameter file pred_gff= #ab-initio predictions from an external GFF3 file model_gff= #annotated gene models from an external GFF3 file (annotation pass-through) est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no <= Change this to 0 protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no <= Change this to 0 unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no <= Change this to 0 unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no |
Step 3: Run WQ-MAKER
Before running MAKER, check to make sure all worker instances have become active.
On the MASTER instance, make sure you are in the "maker_run" directory and all of your files are in place and then run:
Code Block |
---|
$ nohup wq_maker -contigs-per-split 1 -cores 1 -memory 2048 -disk 4096 -N wq_test_${USER} -d all -o master.dbg -debug_size_limit=0 -stats test_out_stats.txt -base test_genome_snap2_augustus > log_file.txt 2>&1 & |
Wait for the MASTER to advertise master status to the catalog server before your run WQ-MAKER on the WORKERS (see below).
Info | ||
---|---|---|
| ||
$ tail log_file.txt Mon Sep 11 15:08:22 2017 :: Submitting file ./test_data/test_genome.fasta_000008 for processing. |
Once the log_file show the above output and once your WORKERS are in active state, then either ssh into or use webshell into each of the WORKERS and then run
Code Block |
---|
$ nohup work_queue_worker -N wq_test_${USER} --cores all --debug-rotate-max=0 -d all -o worker.dbg & |
Note | ||
---|---|---|
| ||
|
Once the job is finished merge gff file
Code Block |
---|
$ cd |
2.2 Run MAKER-P now
Code Block |
---|
$ nohup mpiexec -n 4 maker -base test_genome_snap2_augustus & |
Once the job is finished merge and rename final gff file
Code Block |
---|
$ cd test_genome_snap2_augustus.maker.output $ gff3_merge -d test_genome_snap2_augustus_master_datastore_index.log $ fasta_merge -d test_genome_snap2_augustus_master_datastore_index.log $ maker_map_ids --prefix test_ --justify 8 test_genome_snap2_augustus.maker.output.all.gff > test.genome.all.id.map $ gff3map_gff_merge -n -dids test.genome.all.id.map test_genome_snap2_augustus_master_datastore_index.log # All gene models.all.gff $ gff3map_fasta_merge -g -n -d ids test.genome.all.id.map test_genome_snap2_augustus_master_datastore_index.log -o .all.maker.transcripts.fasta $ map_fasta_ids test.genome.all.id.map test_genome_snap2_augustus.all.maker.gff # Maker gene models onlyproteins.fasta |
Note |
---|
This marks the completion of gene annoation. When you are done, look at one of the larger contigs in a viewer like apollo and compare the raw augustus calls, raw snap calls, and the evidence aware augustus and snap calls produced by maker. If SNAP and augustus are properly trained then they will produce similar calls, and they will also be similar to the evidence aware calls from MAKER (this convergence is the result of the training). If one predictor seems to produce calls that are still very divergent, then just drop that predictor from the analysis. A bad predictor will make all results worse |